Friday, January 19, 2018

Embedding a Binary Search Tree in an Array

Recently I started researching what would be the most efficient data structure to represent an IP address lookup table. An IP address table is useful for making routing decisions or access control, which checks whether a single IP address is a member of a range of netblocks. A netblock is represented by a base address and a prefix length, e.g. 192.168.42.0/24 represents a range of IP addresses from 192.168.42.0 through 192.168.42.255. Hash table is not suitable for this purpose because hashes are computed on exact matches. With the given example, this would require entering all 256 addresses in the range into the hash table, which is a significant blowup.

Hardware routers use Ternary Content Addressable Memory which can tell whether a specific value matches anything in the memory, after applying a bitmask. A software implementation might use a prefix tree (trie), which theoretically deduplicates common prefixes among the entries, resulting in a smaller table. However, any data structure employing pointers has the added overhead of the pointers (a machine word) and cache locality problems, which are not considered in theoretical papers.

The most compact data structure for the vast majority of lookup tables is going to be just an array of IP netblocks. For an unsorted array, lookup takes \( O(n) \) time which will be slow for a big table. If we presort the array and use binary search, the lookup takes \( O(\lg n) \) time. But binary search is a pathological case for cache locality because lookups tend to jump back and forth until it settles in to a target. Rather than a binary search which partitions candidates by two parts, it was shown that ternary search (three partition) improves performance, while quaternary search (four partition) improves little over binary search.

It had occurred to me that cache locality for binary search could be improved if we had simply rearranged the presorted array to become an embedding of binary search tree (BST), such that for node at index \( i \), its left and right children are at \( 2i + 1 \) and \( 2i + 2 \) respectively. The scheme is inspired by how binary heap is embedded in an array, but we are embedding a complete BST. The rationale is that the search always occurs left to right in the array, with the most accessed items located at the beginning of the array, which gets the most cache hits.

Here is an example showing how a presorted array with the number sequence \( 0 \dots 9 \) could be embedded.


Note that this embedding works for any presorted array. The values for \(i\) are the indices of the BST embedding, and the values for \(x\) are the indices in the presorted array. A presorted array can always be represented by a complete BST, so the array is dense. However, we can't embed a prefix tree this way because a prefix tree is sparse.

The implementation strategy is to presort the array, create a new array, and copy the items from the presorted array to the new array in the order of the BST embedding. The preprocessing only needs to be done once, and it results in the most compact representation of the lookup table. The assumption is that once the table is preprocessed, it won't need to be changed again.

I implemented the embedded BST in Go and compared its lookup performance with the builtin sort.Search() which performs binary search on a presorted array. The runs are repeated for increasing array sizes in the powers of 2 as well as stride width (both in terms of the number of machine words, not in bytes). The items to be searched are incremented by the stride width for each iteration; small stride approximates sequential lookup, whereas large stride approximates random lookup. Each of the data point is the average number after running some 2,000,000 to 100,000,000 iterations (the number of iterations varies due to the idiosyncrasies of the Go benchmark framework, not my fault; it tries to run each case for about 5 seconds).

The folly of this is that sequential lookup (smaller stride widths) may not cover a significant portion of a very large array size \(2^n\) where \(n \ge 20\), but the experiment is fair because (1) both algorithms are subject under the same condition, and (2) the array size above \(2^{20}\) already exceeds L2 cache size on the machine running the experiment, which is sufficient to demonstrate performance difference due to cache locality. The performance numbers are collected on a late 2013 model of MacBook Pro 13" whose Core i5 "Haswell" I5-4258U CPU runs at 2.4 GHz and has 32KB L1 cache and 3MB L2 cache. This machine has 8GB DDR3 ram. The program is compiled in 64-bits.


[popup figure in a new window]

You can use the drop-down menus to highlight the series for a specific algorithm (binary search, embedded BST) to see how they perform at various stride widths, or to highlight specific stride widths to compare how the two algorithms perform.

Here are some observations.
  • At small array sizes \(2^n\) where \(n \le 7\), all lookups fit in the L1 cache. The two algorithms have identical performance at all stride widths.
  • From array sizes \(2^7\) to \(2^{12}\), the L1 misses start to happen. Array size \(2^{12}\) occupies 32KiB, the size of L1 cache.
  • Array sizes \(2^{12}\) through \(2^{18}\) still fit within the L2 cache, after which the performance starts skyrocketing at the largest stride width (which simulates random access). Even so, embedded BST held up much better than binary search.
  • At small strides, embedded BST has a more predictable performance than binary search across all array sizes.
  • Embedded BST performs at least as well as or better than binary search across all array sizes and strides.
Admittedly, embedded BST is optimized for lookup. After the initial preprocessing, we don't try to insert or remove entries from the table. If a modifiable table is desired, we can relax the density of the array so that the BST only needs to be approximately balanced. This allows us to use any classical balanced BST algorithms like red-black tree or AVL tree. It may be worthwhile to investigate the cache performance of array embedded versions of these algorithms as future work.

Some IP address lookup use cases require inserting and deleting individual addresses, such as for the purpose of intrusion detection. Since they only need to match individual addresses rather than netblocks, a hash table would suffice.

Otherwise, embedded BST is a great data structure for immutable IP address range lookup tables that is initialized once and optimized for lookup. It has the most compact representation and has much improved cache locality over binary search.

Thursday, January 11, 2018

C++, the cause and solution to life's pimpl problems


Can a programming language invite programmers to write anti-patterns that makes them write more code that is also harder to follow? Certainly, it can.

PIMPL is a much hyped code pattern in C++ to separate a class's public declaration from the implementation detail by wrapping a pointer to the implementation object in the public class. This is what the user of the public class sees. The pointer is the only member of the public class.
// In xyzzy.h

class Xyzzy {
 public:
  Xyzzy();   // Constructor.
  ~Xyzzy();  // Destructor.

  void Foo();

 private:
  class Impl;  // Incomplete forward declaration to the implementation.
  Impl *pimpl_;  // Pointer to incomplete class is allowed.
};
The implementation would be written like this, hidden away from the user.
// In xyzzy.c

class Xyzzy::Impl {
  ...
  void Foo() {
    // Actual code.
  }
  ...
  // All private members are declared in the Impl class.
};

Xyzzy::Xyzzy()  // Constructor.
  : pimpl_(new Impl) {}

Xyzzy::~Xyzzy() {  // Destructor.
  delete impl_;
}

void Xyzzy::Foo() {
  impl_->Foo();
}
There are some elaborate concerns like constness preservation, use of smart pointers, copying and moving, memory allocation, and generalizing the boilerplate to a template, but I want to take a step back and look at what PIMPL accomplishes, and why it is necessary.

The main reason for using PIMPL is to reduce compilation dependencies, so that changes made to the implementation class would not force the user of the public interface to have to recompile code. Recompilation is necessary when the change breaks binary compatibility. In C++, a change could break binary compatibility for banal reasons like adding a private member to a class.

When class members are added or removed, the size of the class would change. In C++, it is the caller's responsibility to prepare memory storage for the object, regardless whether the object is stack or heap allocated. Without recompilation, the storage might be too small. Bigger storage is not a problem, but the compiler has no way of telling the old size. With PIMPL, the size of the public class is always just a single pointer to the implementation object, so it stays the same. It shifts the responsibility of storage preparation from the user to the implementation.

Another type of change that warrants recompilation is when virtual methods are added or removed, which changes the ordering of function pointers in the vtable. Without recompilation, the old code would end up calling the wrong virtual method since the index changed. PIMPL would not be able to alleviate this type of recompilation; adding or removing non-virtual methods from the public class would still require recompilation under PIMPL.

One might ask, instead of PIMPL, why not use the abstract base class with a static constructor? It would look something like this for the public interface.
// In xyzzy.h
class Xyzzy {
 public:
  static Xyzzy *New();  // Static constructor.
  virtual ~Xyzzy();  // Virtual destructor.
  virtual void Foo();

 protected:
  Xyzzy();  // Not to be constructed directly by user.
};
And the implementation:
// In xyzzy.c
class XyzzyImpl : public Xyzzy {
 public:
  void ~XyzzyImpl() override {
    ...
  }

  void Foo() override {
    ...
  }

  // Other members.
};

Xyzzy *Xyzzy::New() {
  return new XyzzyImpl;
}
The problem is that the abstract base class forces all methods to be virtual, and virtual method calls are more expensive because it's harder to do branch prediction with function pointers. The responsibility to manage storage is shifted to the static constructor, so adding or removing members to the base class shouldn't affected binary compatibility. Even so, it still requires recompilation in practice, so members should be declared in the implementation class only.

The common theme here is that any change to a class in C++ requires recompilation, period. That's because the compiler can't tell what changed, so it can't tell whether the change may or may not affect binary compatibility. Contrast this with C, where user of a struct never has to know its size, without relying on virtual methods.
// In xyzzy.h

struct xyzzy;  // Incomplete type.

struct xyzzy *new_xyzzy();              // Constructor.
void delete_xyzzy(struct xyzzy *this);  // Destructor.
void xyzzy_foo(struct xyzzy *this);     // Method foo().
Although most build systems would still recompile the translation unit that includes xyzzy.h for posterity, it doesn't have to if the changes are binary compatible. This is why an executable compiled with an older header could still be dynamically linked with a newer version of the shared library without recompilation, and it would still work.

In the end, I think any effort to reduce recompilation for C++ code is futile. C++ is inherently designed for whole-program compilation. There are other reasons, like how template instantiation requires the implementation source code to be available. But any class would require the user to know its size, and one has to go through great lengths to hide that from the user in order to ensure binary compatibility.

Clearly, PIMPL is an anti-pattern that is motivated by the way C++ is designed. It's a problem that C never has to worry about.