Sunday, March 21, 2010

Ordered Intervals

Although we have efficient data structures to express sets, such as balanced binary search tree and sparse hash table, some sets are most efficiently expressed as an interval. An example is the character set, which is typically used for regular expression parsing, e.g. the set [0-9A-Fa-f] denotes characters for hexadecimal number digits. Historically, ASCII only has 127 characters, so a direct-index table would suffice; the table could perhaps be bitmapped, where each bit represents one code point, for a total of 16 bytes in the table. The state transition table for finite state automata representing a regular expression will only use a couple of hundred bytes. However, supporting regular expression parsing for Unicode is more challenging. Unicode currently allocates 20-bits per character (Unicode 5.0), and a typical set could easily contain several dozen code points, so a direct-mapped table certainly will not work well (one bitmapped row in the state transition table would be 128KB). A plain old list of character interval would work much better.

For the purpose of discussion, let's forget about character sets.

Assume we have a set expressed by a set of intervals, so we determine if a value is a member of the set by examine if the value lies inside any of these intervals. Set union is straightforwardly defined by union of the set of intervals. We'll focus on set membership for now and ignore set intersection. If intervals are arranged in a list, searching still takes linear time. In order to allow binary searching in logarithm time, we need a way to order these intervals.

Ordering is tricky if intervals overlap. Fortunately, overlapping intervals can be trivially combined, e.g. [5, 10] ∪ [7, 12] = [5, 12]. However, there is still a corner case. How can we tell if the interval [5, 9] ∪ [10, 14] equals to [5, 14] or not? It depends. We can combine them if the intervals range over natural numbers; not if the intervals range over real numbers. However, we shouldn't make any assumption about the continuity of elements. In order to avoid corner case like this, we make one side of the interval an open interval. It is customary in computer science to use intervals like [a, b), where a is inclusive, and b is exclusive, because for-loops are usually written this way.

We'll have a rule saying [a, b) ∪ [c, d) = [a, d) if b ≥ c. In particular, [a, b) ∪ [b, c) = [a, c).

The rest of interval tree implementation can be seen in Introduction of Algorithms, pages 311-315.

No comments: