?

Log in

No account? Create an account
Who, me? [userpic]

More details for the Parcel of Penguins

January 9th, 2004 (11:03 pm)
current mood: driven
current song: Kate Bush - Rubberband Girl

I've been thinking some more about my Parcel of Penguins project. The big question to answer has been, what sort of model does it present to the user? There are three basic approaches to take:

  1. The cluster acts like a big disk drive, a bucket o' bits that the OS can format however it wants. This seems to be the approach taken by most commercial storage products today; they accept commands over the network to read and write individual blocks on the disks. The upside is that the OS can do whatever it wants with it; the downside, as I understand it, is that only one client machine can access it at a time, because filesystems are designed with the assumption that the disk drive is attached to only one machine (usually reasonable). This makes this approach problematic, because one of the great things to do with a storage cluster would be to serve files up to a computational cluster, which means lots and lots of clients.
  2. The cluster acts like a big Web server, letting clients store and retrieve whole files. This would be really simple, but, in order to work with a file, you'd have to copy it down to your local disk. It would be highly inefficient for making a small change to a large file. Even worse, for things like video editing, it would mean you'd have to have space on your local drive for all the files you wanted to edit, which could be enormous; the cluster would basically become a backup drive. Since video editing is one of the significant applications driving the need for large amounts of disk space, this would be a Bad Thing.
  3. The cluster acts like a big file server, letting clients read and write files as freely as if they were on the local drive. I think this is the best approach: it lets applications use the cluster directly, without having to be rewritten to store and retrive files there, and it's friendly for multiple clients. And, frankly, multiple clients are where this cluster has its best chance to shine; they can take advantage of the fact that there are, in effect, multiple paths out of the cluster (provided the clients and the cluster nodes are on the same Ethernet switch), and get better total performance than a single client could.

So, I think my best bet is to modify the Linux NFS implementation. The server will be extended so that, when one server gets a write request, it mirrors it to its peers (with a transaction log). The client will be extended to know about multiple servers.

Some terminology:

cluster
The entire set of machines that work together to serve up files. The cluster is made up of one or more volumes; each file is stored on one volume.
volume
Each file is stored in one volume. A volume is made up of one or more devices; each file on a volume is stored in a separate copy on each device.
device
An individual disk drive. Every device in a volume has the same contents. (Actually, a device might not correspond to a whole drive; it might be a partition, or even just a directory. It'd depend on how much hardware you could afford, really. For my initial implementation, I might well wind up with just directories.)
hash function
The algorithm that specifies which volume holds a given file; it takes the file pathname as input and returns the volume (name, address, whatever). I keep going back and forth over whether the hash function should be executed on the client. (The alternative is to have it run only on the server, and have the client query the server each time it needs a hash.) Running on the client makes for better latency, but complicates configuration, because it requires the hash function to be downloaded. (This is easy if it's a fixed function with simple parameters, and difficult if it can be arbitrary code.) Querying the server is easier; the servers will all have to be configured to know the hash function anyway, so they can refuse to serve files that aren't assigned to them. I suspect my initial implementation will cheat and put it on the client with static configuration.

So, when a client wants to read or write a file, it will check the hash function, find the appropriate volume, then pick a device to talk to. In some scenarios, a sophisticated client might even talk to multiple devices. For example, suppose the cluster were set up with 100Mbps Ethernet, but the client, on the same switch, had Gigabit Ethernet. The client could read odd-numbered blocks from device A and even-numbered blocks from device B, getting an effective bandwidth of 200Mbps.

Keeping the devices in a volume synchronized is the hard part. I think the flow has got to be something like this:

  • Client sends write request to device A.
  • Device A writes data to file, and logs the transaction.
  • Device A forwards write request to devices B and C in the same volume.
  • Device B writes data to file, logs transaction, acks A.
  • Device C writes data to file, logs transaction, acks A.
  • After last ack comes in, A acks client, sends close msg to B and C, deletes the transaction from the log.
  • B gets close msg, deletes transaction from its log.
  • C gets close msg, deletes transaction from its log.

If, say, C does not respond to the write request, then A assumes C is down. It acks the client, but does not close the transaction. Now, when C comes back up, it will consult A and B for transactions it missed, read the data from them, and update its copy of the files.

There are still questions to resolve. Does A need to issue a lock, so that B and C don't accept overlapping write requests, and wind up with data out of sync? Probably.