Saturday, May 19, 2012

Locality profile of a disk drive

When I was an undergraduate, I had at least one professor who told his students that we're told lies in order to be told the truth. The lies are simplification of the truth so that we could understand the point the professor wants to get across.

In the case of disk performance, I've never been told the truth, and I never knew it until I found this paper, “An introduction to disk drive modeling.”

We were taught that seek time is the dominant factor of access latency, and that “average seek time” is proportional to 1/3 of the number of tracks. The truth is much more complicated.
  • A seek consists of four phases, speedup (accelerating the head), coast (head travels at constant velocity), slowdown (deceleration of the head), and settle (the head aligns itself to the track). The way shouting causes access latency to spike is probably due to interfering in the settling phase.
  • There is more surface area towards the outer tracks than the inner tracks, and assuming that the disk is to have constant linear density in both x-y dimensions, the angular and radius differential will be non-linear. As thus, we could have either:
    • Constantly spaced tracks where inner tracks have fewer sectors and outer tracks have more sectors, or
    • Tracks with constant number of sectors, where inner tracks are further apart and outer tracks are closer, or
    • A hybrid model that is somewhere in between, depending on which geometry gives the better seek time.
Also, most disk drives nowadays expose a linear addressed sector scheme as opposed to CHS (cylinder, head, sector) which is specific to disk geometry. The linear addressing hides away the complex geometry that is disk specific, and all we care about nowadays is the seek time profile from one linear addressed sector to the other sectors.

Quite interestingly, where as in the case of random accessed memory, temporal locality prevails spatial locality, it is the other way with disk drives. For random accessed memory, spatial locality at worst will cause n times performance degradation where n is the size of the cache line divided by the word size, a lack of temporal locality can often cause a total breakdown of the memory hierarchy. For disk access, the lack of temporal locality means that a group of n read/write operations that could previously be grouped together in one single seek sweep would have to be performed as different seeks back and forth, it is the lack of spatial locality of the on-disk data arrangement that causes the seek time to soar.

As part of a disk-based filesystem design, it should allow files accessed together to belong to a (spatial) locality group.

There are already databases that store columns into locality groups of columns, but that sort of locality group only affects the “virtual” linear address within a file, which by virtue of the filesystem may be fragmented throughout the disk.