Saturday, June 14, 2014

Bitrot Resistance on a Single Drive

This morning I read an interesting summary on Slashdot about bitrot under HFS+. The article is a bit misleading because bitrot is not really an HFS+ problem per se, but the author's qualm is that HFS+ does not support data integrity feature like ZFS. This merely echoes a sentiment that ZFS is slowly making its way to Mac OS X which was not happening due to the NetApp v. Sun Microsystems (now Oracle) patent lawsuit claiming that the copy-on-write feature in ZFS infringes on NetApp's WAFL technology. Although the lawsuit was dismissed in 2010, it appears that Apple had lost interest.

But is ZFS really the silver bullet against bitrot? My argument is that bitrot should not be handled at the filesystem layer, but done at the block storage layer, if you only put data on a single drive.

ZFS data integrity uses block checksumming. Rather than storing the checksum in the block, the block pointers are really a pair (address, checksum). This pointer structure is propagated all the way to the root, forming a Merkle tree. You can configure the filesystem to store several copies of a block, which divides the usable data capacity of a drive by the number of copies (or \( 1 / k \) of the capacity, where \( k \) is the number of copies).  ZFS can also manage a disk pool for you in a RAID-like configuration. RAID-Z uses parity so the drive capacity is \( (n - 1)/n \) where \( n \) is the number of drives (which must be identically sized). Most people don't have disk arrays, so they're stuck with multiple copies at \( 1 / k \) capacity.

Given that the original author of the bitrot article suffered only 23 files over a 6 year period (on the filesystem there are usually millions of files), reducing your drive capacity to a half or a third seems a dire price to pay.

The ideal case of bitrot resistance is if the block storage performs Reed-Solomon error correction. This have been the way CD-ROM stores data. CD-ROM physical sectors are 2352 bytes, but data sectors are 2048 bytes; the hidden bytes are used for error correction.

Note that Reed-Solomon is generally more robust the more data there are. For hard drives (magnetic spinning discs, not SSDs), it would make sense to do Reed-Solomon on a whole cylinder, since hard drive performance overhead is bottlenecked on seeking: movement of the reading heads to a particular cylinder position is a delicate operation. Reading the extra sectors in a cylinder is free. The drive doesn't have to implement this. The extra error correction can be done by the logical volume manager, which is a block storage layer underneath the filesystem but above the actual device.

But one complication is logical block addressing (LBA) that numbers sectors linearly in order to overcome disk size limited by BIOS int 13h, which is an outdated API for accessing disk drives in the MS-DOS days. LBA hides the disk geometry due to the linear numbering. The error-correcting logical block layer could be more efficient if it knows the underlying disk geometry. Otherwise the error correction might span cross two physical cylinders, which is suboptimal. But it might not be too big of a deal since moving the disk head to the next cylinder is much easier than swinging it across the disk surface. The notion of a cylinder is also largely decimated in modern disks because outer cylinders have bigger circumference, and can have more sectors than inner cylinders. It would suffice to pretend that a pseudo-cylinder be merely a span of continuous sectors.

To get the best performance, the filesystem might wish to use a logical block size equaling the cylinder size. Depending on the disk geometry, the block size would typically be ~1MB. This would waste space unless the filesystem can pack small files, like BtrFS. Therefore, it is possible to achieve bitrot resistance on a single drive without performance impact.

Now, if you want to put data across an array of drives, then doing Reed-Solomon error correction across all drives that might fail independently would be the ideal. Note that you don't want to do error correction across drives with correlated failure, e.g. (1) drives from the same manufacturing batch subject to the same wear, or (2) drives that can be destroyed by the same power failure or surge at a site.