Wednesday, December 23, 2009

Creating a block device backed by Amazon S3

There is a lot of benefit using Amazon S3 as a network drive, namely it will be automatically replicated across many storage nodes, so chances are the files will survive pretty well.  However, S3 comes with several limitations: files have to be 5GB or less, and files can only be uploaded as a whole.  Partial download works.  No partial upload is a consequence of the distributed nature of the S3 storage system.

Of course, a user mode file system like s3fs over FUSE can always elect to shard a large file into smaller S3 objects, say 4MB each.  It would need to hide the individual chunk objects and present only one coherent file.  While this might seem adequate, hiding objects means that some file names are not allowed.

Another issue is that S3 objects are stored inside buckets that have no directory structure.  Properly supporting directory structure means storing directory listing meta-info as an object in the bucket as well.

What if the user wants files to be encrypted transparently?

These issues all signify that, in order to build a proper network drive on top of S3, we need to implement a lot of file system features from scratch.  It would be much easier if we could store a file on S3 and use it as a block device.  Let's go back to the drawing board and redo the sketch.

  • First, we'll have a simple sharded S3 file system over FUSE that can store large files as chunked objects.
  • We create a large file to be used as a loopback device, format, and mount it.
    • dd if=/dev/zero of=/mnt/s3/fs.img bs=1M seek=$((device_size_in_MB - 1)) count=1
    • mkfs.ext4 /mnt/s3/fs.img
    • mount -o loop /mnt/s3/fs.img /mnt/fs
We get all features of the filesystem (ext4 in this case), including all sorts of POSIX ACL goodies, for free.  The operating system kernel even implements caching for us, so we can enjoy great performance!

This also works similarly on Mac OS X: just use Disk Utility to create a disk image.

If you want encryption, no problem.

  • dd if=/dev/zero of=/mnt/s3/crypt.img bs=1M seek=$((device_size_in_MB - 1)) count=1
  • losetup /dev/loop0 /mnt/s3/crypt.img
  • cryptsetup luksFormat /dev/loop0
  • cryptsetup luksOpen /dev/loop0
  • mkfs.ext4 /dev/loop0
  • mount /dev/loop0 /mnt/crypt
One problem with using S3 as a block device is that the image file size is fixed, but depending on the filesystem you use, it should be possible to resize the filesystem online.  It might even be possible to use ZFS by adding block storage to the pool.

If you're not happy with the 4MB per chunk data transfer, you can easily reduce the transfer amount.  Instead of mounting the block device image over S3, do the following:
  • Build an Amazon EC2 AMI to mount the disk image on S3 as a block device.
  • Make sure ssh and rsync are installed on the machine image.
Then you can do rsync to backup your local files with only very minimal incremental transfer to the Amazon EC2 cloud.  The 4MB per chunk data transfer between EC2 and S3 (within the same region) is free and likely very fast.  The instance itself only needs to be running when you start the backup, and can be terminated as soon as the backup is done.

Now, the only missing piece is a FUSE module for sharding large files into smaller objects on S3.  Anyone wants to take on that?

Saturday, December 19, 2009

Wireless Doorbell Instant Messaging Bridge

The apartment building I live in never has a working intercom.  It has something that resembles an intercom box, and it is supposed to dial a preset phone number when you push a button for each tenant, but the landlord never bothered to keep the presets updated.  My visitors can call my cellphone, but I often cannot have UPS or FedEx send packages to home.

Other tenants also have the same problem.  They would leave notes repeatedly asking UPS to just leave their package at the entrance hallway, but UPS never do that due to policy reason.  Some tenants attempted to leave a phone number, but once I ran into the UPS delivery guy---he told me the company does not issue them corporate cellphones, and he never uses his personal cellphone for work.

Although there are wireless doorbell systems that are very cheap (~$30), I don't want to just install a doorbell for my own unit.  I want to have the doorbell send an instant message that can be broadcasted to anyone who is interested, including other tenants of this apartment.  The question is how to build a system like that.

The first iteration of this idea involves using wireless sensor network, but they require interfacing with a gateway, which must be connected to a computer for Internet connectivity.  I want to leave the computer out of the picture.  Besides, wireless sensor networks aren't cheap.  I forgot the exact pricing range, but a gateway could cost $300-$600 or more, and each node is also $100-$200.  And since many of these are still academic prototypes, you don't find many places to buy them.  Someone could steal the doorbell, and it could cost me $200 to replace it.  This seems like an expensive doorbell.

Then I wondered if a "system on a chip" would be a good idea.  I found the PIC Micro Web, which is really a small computer with a parallel port and an ethernet port.  I could solder a push button to the parallel port, which would fire something off the TCP/IP on the ethernet side.  The price is reasonable.  The only problem is, it only works on Ethernet.  In search of similar units, I found Digi Connect ME, which has a sibling wireless model Digi Connect Wi-ME.  This theoretically allows me to connect the door bell push button to my wireless home network and send instant message over my DSL.  However, it's still a bit pricy, at $130 per unit.

Then it occurred to me that I could use a Wireless to Ethernet bridge which may or may not support Linux.  I found Linksys WGA600N, which based on a Ubicom chipset that has a Linux based SDK.  With that device, maybe I could find out which one is the serial port, and repurpose that for door bell button connection.  The cost is $80, but I think it is acceptable.

Wednesday, December 16, 2009

ZFS on Amazon EC2 EBS

One of the Amazon EC2 advantage for customer like you and me is that you can rent a computer from their data center, and you're only charged what you use—computing hours, instance sizes, network transmission—with the exception of EBS where you are charged by the amount of storage you provisioned to a volume. For example, you're still charged 1TB of storage cost if you only used 1GB of that 1TB. Since EBS is just a block device, there is no way for Amazon to tell how much of it you actually used.

What happens if you grow out of the space you initially provisioned to a volume? You can expand it like this: detach the EBS volume, make a snapshot of the volume to S3, use that snapshot to create a new EBS volume of a greater size, and attach the new volume. You will likely need to use an operating system tool to resize the file system to the new volume size as well.

While this might work well for some people, I'm not entirely happy about this approach. The time it takes to make the snapshot is directly proportional to the amount of data you have stored. The down-time seems unavoidable.

I've been studying management of a ZFS pool, and here are my findings. You can create EBS volumes and add them to a pool. Then when you want more space, just create a new EBS volume and add them to the pool. The pool will enlarge automatically to reflect additional space.

The smallest EBS volume you can create is 1GB, but I don't suppose anyone would want to create lots of 1GB volumes. This will be a nightmare to keep track of. Fortunately, ZFS also allows us to replace smaller disks with larger disks. You can keep a pool with about 3-4 EBS volumes, and when you want more space, just create a new one with more space, and use it to replace the smallest disk in the pool. This way, only 1/3 or 1/4 of the data in the pool needs to be transfered. Furthermore, all ZFS pool operations are performed online, so there is no downtime.

What if you want ZFS mirror or ZFS raidz redundancy? I found out that, unless you're one of the lucky users of Solaris Express build 117, which provides autoexpand capability, the disks that are part of a mirror or a raidz are not automatically expanded even after all disks are replaced with larger disks. Such is the case for zfs-fuse on Linux. However, I found out that a zpool export followed by zpool import updates the size. Or you can reboot your computer. Again, the downtime now is the amount of time it takes to reboot your EC2 instance, and not the amount of time it takes to make snapshots, which is much better.

The disadvantage now, however, is that you can no longer snapshot the whole filesystem at once, which spans across multiple EBS volumes.

Wednesday, December 9, 2009

Supporting Math on Google Wave

This is not a software release, but I spent a few hours plowing through the documentation on Google Wave API, and I'm just noting down how it may be plausible to support LaTeX math equations on Google Wave.

The first aspect is rendering math equation. One essentially makes a Wave Gadget that decides how to turn LaTeX into presentable math equation. There are two possibilities: (1) use the undocumented Google Chart API, but the LaTeX equation length is extremely limited; (2) embed jsMath, and use it for math rendering. The disadvantage of the jsMath approach is that it is rather heavyweight, since each gadget has to run the same jsMath initialization code separately. A third option, which I don't consider a possibility, is to host NTS (a Java implementation of TeX) on AppEngine. The NTS implementation of full TeX stack (TeX to DVI, DVI to image) might be considered heavyweight, but the real challenge would be to workaround its reliance on a real filesystem, which makes it difficult to port to AppEngine sandboxed environment.

Once the rendering part is done, it may be convenient to convert LaTeX code to the rendering gadget on the fly. This can be done using a Wave Robot that listens for document changes and scan for LaTeX code in the blip. I think this is relatively straightforward. However, a concern is that each change requires the robot O(n) time to rescan the whole blip. Editing in Wave would then become noticeably slower as document size grows. This scalability issue must be addressed. I think the current workaround is to have Wave only contact the robot sparingly. This also means robot updates are going to be slow. It may be possible to integrate everything better in the future, by processing update events directly in the Wave Client without using a robot.