[SGVLUG] Cluster Filesystems

Sun Jan 8 10:03:55 PST 2006

Chris,

NetApp does support multiple head and device redundancy via a cluster 
software option - this is just not economical feasible for many people. 
For our customers with the 6 figure storage budget we can sell them a 
wide selection of different options all with exceptional redundancy and 
failover options - I am looking for something in the sub $10,000 dollar 
range.

A series of ATA/SATA disks in a 4U w/ a Raidcore or 3ware controller 
works well for mass storage requirements - however the recovery time due 
to disk failure is way to painful (a disk failure on a 4TB array w/ 
500GB drives will take most of the day to rebuild.

I am looking to build a true COTS based system with each individual node 
having between 1 and 4 hard drives. I want the data to be replicated 
across the nodes in the storage cluster so that if one of the nodes 
fails the storage is still accessible. I know with CIFS/NFS this does 
not equate to automatic failover for the client - but the outage window 
to reconnect to a different node is acceptable as long as the storage is 
still accessible. While all of the storage nodes will be running Linux 
the clients need to be a mix of Linux, Windows, Sun, OS X hence the need 
for NFS/CIFS.

I know this is possible because Isilon has implemented this system with 
a proprietary OS/Filesystem - I just need to know if this functionality 
(of the filesystem) exists within the Open Source world.

Thanks,
Max

-- 
   Max Clark
   max [at] clarksys.com
   http://www.clarksys.com

Chris Smith wrote:
> On 1/7/06, Max Clark <max at clarksys.com> wrote:
>> A recent failure of a customer's NetApp has again left me looking for a
>> better approach to network storage - specifically in the redundancy and
>> replication model. For the sites that can afford it we recommend and
>> sell the Isilon systems which give what I am looking for... multiple
>> nodes striped together to provide a distributed pool of storage that can
>>   survive a node failure.
> 
> I was pretty sure that NetApp had a way to approximate a functional
> high availablility system (something where one node would take over
> the IP of a failed node). It isn't perfect, but functional.
> 
>> Ideally I'd love to run the Google File System
>> (http://labs.google.com/papers/gfs.html) but I don't think they would
>> give me a copy.
> 
> The Google File System would probably not work to well for you any
> way. It isn't a proper POSIX file system (really, it's an API for
> managing data). It's optimised for a specific problem domain. It makes
> assumptions that are unlikely to be true in the general case (perhaps
> true in your case), like that file are mostly quite large, that one
> doesn't ever need to write to a file with anything other than an
> append, etc., etc.
> 
> That said, if you are looking for something like it, you can look at
> the code in the nutch project:
> 
> http://lucene.apache.org/nutch/
> 
> The have implemented their own data management system which is based
> on similar principles as the Google File System.
> 
>> Which leaves me with AFS and CODA. Can anyone give me
>> real world examples/tips/tricks with these systems?
> 
> Don't use CODA for anything serious. AFS is a nice file system, but
> it's not really a cluster filesystem. It does function better (mildly)
> in the event of a server failure, but it is really just a network
> filesystem.
> 
>> I would like to be able to take a bunch of 1U single CPU machines with
>> dual 250GB-500GB hard drives and cluster them together as a single NAS
>> system supporting NFS/CIFS clients. I figure I should be able to get
>> 0.2TB of usable protected storage into a node for ~$800/ea, this would
>> mean $5,600 for 1TB of protected storage (assuming parity and n+1).
> 
> May I ask why you want to use multiple machines if you're still going
> to present an NFS/CIFS interface? In general, clustered filesystems
> really only make sense if the clients access them via their native
> interface.
> 
> If you think about it, a single 4U machine with a nice RAID storage
> system. Heck, with SATA drives you can actually get that kind of
> storage out of a 1U (4 400GB drives with RAID-5 and you've got 1.2TB
> of storage) and at a bargain basement price (although without the same
> kind of high transaction rate performance you'd expect from higher end
> drives). While not a super high availability system, it'd have as good
> availability as what your are envisioning.
> 
> If you are doing NFS/CIFS, you just aren't going to get the kind of
> redundancy you are talking about. If a client is talking to an
> NFS/CIFS server when it dies, there is going to be a service
> interruption (although particularly with UDP NFS you can do some
> clever things to provide a fairly smooth transiion). Probably the
> simple way to do that is have a designated master which serves an
> NFS/CIFS interface and then use Linux's network block device to RAID
> together the drives on all the other machines.
> 
>> Thoughts and opinions would be very welcome.
> 
> You probably are looking for a clustered file system. The ones that
> come to mind immediately are Lustre, SGI's CXFS, Red Hat's GFS, OCFS,
> PVFS, and PVFS2 (there are others).
> 
> We are experimenting with Lustre at Yahoo Research, and I can say that
> the early results show just amazing performance, although you do need
> a really nice switch to sustain it. The down side of Lustre is that it
> only supports Linux clients. CXFS has a fairly broad set of client
> drivers, but I don't know of a Windows client. Same for GFS really. I
> think only PVFS has one --maybe OCFS has one, but I never looked. PVFS
> and PVFS2 are more geared towards massively parallel computer systems
> (to a certain extent all of the ones I mentioned are, but PVFS and
> PVFS2 are exceptional), so unless you are working on that you are
> probably better off with a more general purpose clustered filesystem.
> 
> --
> Chris
>