Hardware routers use Ternary Content Addressable Memory which can tell whether a specific value matches anything in the memory, after applying a bitmask. A software implementation might use a prefix tree (trie), which theoretically deduplicates common prefixes among the entries, resulting in a smaller table. However, any data structure employing pointers has the added overhead of the pointers (a machine word) and cache locality problems, which are not considered in theoretical papers.

The most compact data structure for the vast majority of lookup tables is going to be just an array of IP netblocks. For an unsorted array, lookup takes \( O(n) \) time which will be slow for a big table. If we presort the array and use binary search, the lookup takes \( O(\lg n) \) time. But binary search is a pathological case for cache locality because lookups tend to jump back and forth until it settles in to a target. Rather than a binary search which partitions candidates by two parts, it was shown that ternary search (three partition) improves performance, while quaternary search (four partition) improves little over binary search.

It had occurred to me that cache locality for binary search could be improved if we had simply rearranged the presorted array to become an embedding of binary search tree (BST), such that for node at index \( i \), its left and right children are at \( 2i + 1 \) and \( 2i + 2 \) respectively. The scheme is inspired by how binary heap is embedded in an array, but we are embedding a complete BST.

**The rationale is that the search always occurs left to right in the array, with the most accessed items located at the beginning of the array, which gets the most cache hits**.

Here is an example showing how a presorted array with the number sequence \( 0 \dots 9 \) could be embedded.

Note that this embedding works for any presorted array. The values for \(i\) are the indices of the BST embedding, and the values for \(x\) are the indices in the presorted array. A presorted array can always be represented by a complete BST, so the array is dense. However, we can't embed a prefix tree this way because a prefix tree is sparse.

The implementation strategy is to presort the array, create a new array, and copy the items from the presorted array to the new array in the order of the BST embedding. The preprocessing only needs to be done once, and it results in the most compact representation of the lookup table. The assumption is that once the table is preprocessed, it won't need to be changed again.

I implemented the embedded BST in Go and compared its lookup performance with the builtin sort.Search() which performs binary search on a presorted array. The runs are repeated for increasing array sizes in the powers of 2 as well as stride width (both in terms of the number of machine words, not in bytes). The items to be searched are incremented by the stride width for each iteration; small stride approximates sequential lookup, whereas large stride approximates random lookup. Each of the data point is the average number after running some 2,000,000 to 100,000,000 iterations (the number of iterations varies due to the idiosyncrasies of the Go benchmark framework, not my fault; it tries to run each case for about 5 seconds).

The folly of this is that sequential lookup (smaller stride widths) may not cover a significant portion of a very large array size \(2^n\) where \(n \ge 20\), but the experiment is fair because (1) both algorithms are subject under the same condition, and (2) the array size above \(2^{20}\) already exceeds L2 cache size on the machine running the experiment, which is sufficient to demonstrate performance difference due to cache locality. The performance numbers are collected on a late 2013 model of MacBook Pro 13" whose Core i5 "Haswell" I5-4258U CPU runs at 2.4 GHz and has 32KB L1 cache and 3MB L2 cache. This machine has 8GB DDR3 ram. The program is compiled in 64-bits.

[popup figure in a new window]

You can use the drop-down menus to highlight the series for a specific algorithm (binary search, embedded BST) to see how they perform at various stride widths, or to highlight specific stride widths to compare how the two algorithms perform.

Here are some observations.

- At small array sizes \(2^n\) where \(n \le 7\), all lookups fit in the L1 cache. The two algorithms have identical performance at all stride widths.
- From array sizes \(2^7\) to \(2^{12}\), the L1 misses start to happen. Array size \(2^{12}\) occupies 32KiB, the size of L1 cache.
- Array sizes \(2^{12}\) through \(2^{18}\) still fit within the L2 cache, after which the performance starts skyrocketing at the largest stride width (which simulates random access). Even so, embedded BST held up much better than binary search.
- At small strides, embedded BST has a more predictable performance than binary search across all array sizes.
- Embedded BST performs at least as well as or better than binary search across all array sizes and strides.

Some IP address lookup use cases require inserting and deleting individual addresses, such as for the purpose of intrusion detection. Since they only need to match individual addresses rather than netblocks, a hash table would suffice.

Otherwise, embedded BST is a great data structure for immutable IP address range lookup tables that is initialized once and optimized for lookup. It has the most compact representation and has much improved cache locality over binary search.

## No comments:

Post a Comment