Wednesday, December 23, 2009

Creating a block device backed by Amazon S3

There is a lot of benefit using Amazon S3 as a network drive, namely it will be automatically replicated across many storage nodes, so chances are the files will survive pretty well.  However, S3 comes with several limitations: files have to be 5GB or less, and files can only be uploaded as a whole.  Partial download works.  No partial upload is a consequence of the distributed nature of the S3 storage system.

Of course, a user mode file system like s3fs over FUSE can always elect to shard a large file into smaller S3 objects, say 4MB each.  It would need to hide the individual chunk objects and present only one coherent file.  While this might seem adequate, hiding objects means that some file names are not allowed.

Another issue is that S3 objects are stored inside buckets that have no directory structure.  Properly supporting directory structure means storing directory listing meta-info as an object in the bucket as well.

What if the user wants files to be encrypted transparently?

These issues all signify that, in order to build a proper network drive on top of S3, we need to implement a lot of file system features from scratch.  It would be much easier if we could store a file on S3 and use it as a block device.  Let's go back to the drawing board and redo the sketch.

  • First, we'll have a simple sharded S3 file system over FUSE that can store large files as chunked objects.
  • We create a large file to be used as a loopback device, format, and mount it.
    • dd if=/dev/zero of=/mnt/s3/fs.img bs=1M seek=$((device_size_in_MB - 1)) count=1
    • mkfs.ext4 /mnt/s3/fs.img
    • mount -o loop /mnt/s3/fs.img /mnt/fs
We get all features of the filesystem (ext4 in this case), including all sorts of POSIX ACL goodies, for free.  The operating system kernel even implements caching for us, so we can enjoy great performance!

This also works similarly on Mac OS X: just use Disk Utility to create a disk image.

If you want encryption, no problem.

  • dd if=/dev/zero of=/mnt/s3/crypt.img bs=1M seek=$((device_size_in_MB - 1)) count=1
  • losetup /dev/loop0 /mnt/s3/crypt.img
  • cryptsetup luksFormat /dev/loop0
  • cryptsetup luksOpen /dev/loop0
  • mkfs.ext4 /dev/loop0
  • mount /dev/loop0 /mnt/crypt
One problem with using S3 as a block device is that the image file size is fixed, but depending on the filesystem you use, it should be possible to resize the filesystem online.  It might even be possible to use ZFS by adding block storage to the pool.

If you're not happy with the 4MB per chunk data transfer, you can easily reduce the transfer amount.  Instead of mounting the block device image over S3, do the following:
  • Build an Amazon EC2 AMI to mount the disk image on S3 as a block device.
  • Make sure ssh and rsync are installed on the machine image.
Then you can do rsync to backup your local files with only very minimal incremental transfer to the Amazon EC2 cloud.  The 4MB per chunk data transfer between EC2 and S3 (within the same region) is free and likely very fast.  The instance itself only needs to be running when you start the backup, and can be terminated as soon as the backup is done.

Now, the only missing piece is a FUSE module for sharding large files into smaller objects on S3.  Anyone wants to take on that?

No comments: