Tuesday, December 1, 2015

Recursively Extensible Open Addressing Hash Table

There are two common ways to implement a hash table: open addressing and separate chaining. Both use a hash function to map keys to indices within an array for fast lookup. They differ in the conflict resolution strategy when multiple keys map to the same index. Upon conflict, open addressing searches for the next available index in the array through a probing policy such as linear probing (increment index by the number of probing attempts), quadratic probing (add index by the number of probing attempts squared), or by enumerating through different hash functions. Separate chaining avoids conflict by maintaining keys of the same index as a linked list.
Open Addressing
Separate Chaining
Both tables can be subject to O(n) lookup for a naive implementation when there are many collisions. A better one would dynamically resize the array depending on the load factor, but resizing is an expensive operation. For separate chaining, performance degradation can also be mitigated by using a balanced binary search tree instead of a linked list. Since the worst case access time is bound to O(lg n), it can retain some efficiency without resizing.

So far, this is text book material prior art.

In terms of cache efficiency, separate chaining has the disadvantage that list nodes may be scattered, so accessing the list can cause many cache misses. Open addressing is more cache friendly because the key and values are stored in a contiguous array. The question is: can we do away without resizing like separate chaining, but enjoy the cache efficiency of open addressing?

Yes we can. Here I introduce the recursively extensible open addressing hash table.
A recursively extensible open addressing hash table keeps the key/value in the table, but also reserves a pointer to an extension subtable per entry. To resolve conflict, it would first probe within the same table for an available slot but limited to k tries with a small k. If the search fails, then it would descend to the extension subtable at the original hashed index, creating the subtable only if necessary. It would use an orthogonal hash function to compute the index within this subtable, repeat the probing, and descend if necessary.

An orthogonal hash function is linearly independent (in the linear algebra sense) to the hash functions used in parent tables. An easy way to compute the orthogonal hash is to consider a universal hash function producing a 32-bit unsigned integer output. The first level hash table has 256 entries, and the index is based on the value in the 8 least significant bits. The second level tables also have 256 entries, with index drawn from the next 8 bits, and so on for the third and fourth levels.

As for the probing policy, we could use any policy like any open addressing hash table would, but linear probing has the best cache locality, and it is sufficient to achieve high load factor without compromising lookup performance. That is because the subtables all enjoy O(1) access as well.