Monday, June 21, 2010

C++ STL std::vector capacity() Not A Substitute For size()

When you use the std::vector<> template in C++, one thing you notice is that the actual capacity(), the number of elements allocated for the internal array, may be different than size(), which is the number of items currently stored by the user. In general, size() <= capacity(). The capacity() can change when reserve() is called, and reserve() can be called by insert() or resize().

When accessing the vector using operator[], the index is not checked against valid bounds. The at() accessor defines valid bounds to be 0 <= i < size(), where i is the argument to at(), and the domain applies to operator[] as well. If you don't check for bounds explicitly, it should at least be an invariant of your code.

It is worthy to note that when size() <= i < capacity(), although your program technically does not access illegal memory (memory checker like Valgrind would not be able to detect it), the elements at those i's are strictly undefined, for the following reasons.
  • When the next resize() expands the size(), it will attempt to overwrite the elements in the internal array from the previous size() up to the new size().
  • insert(end(), ...) will overwrite the elements in the internal array after size().
  • Even if you avoid using resize() and never uses insert(), and rely solely on reserve() and operator[], there could still be a problem when reserve() causes reallocation of the internal array: only up to size() elements are copied from the original array to the new array.
In other words, the elements in the internal array from size() through capacity() are ignored by the vector and assumed to be non-existent. A symptom is that elements with values previously stored by operator[] gets erased to the default, or some other value specified for resize(). It is a bad idea to use capacity() for bounds checking. The proper bounds are defined by size().

Thursday, June 17, 2010

Lack of Lexical Context Closure Impediment to Abstraction

Lexical context (also called the environment) describes what identifiers need to be present as well as their types in order for a fragment of a program to run. Closure allows a first-class function to refer to its lexical context while being called in the lexical context elsewhere. This is useful for writing algorithms as a higher-order function whose behavior can be customized by a function argument. This is widely understood by functional programming community (LISP/Scheme, ML, Haskell) to be a very powerful abstraction tool.

It is only starting to be noticed by the C++ community. In C++, the std::for_each algorithm can be used to apply a function to elements in a container between two iterators. But only in C++0x can you use it without too much notational overhead.
// From Wikipedia: C++0x
std::vector<int> someList;
int total = 0;
std::for_each(someList.begin(), someList.end(), 
  [&total](int x) { total += x; }
);
std::cout << total;
The idea is that std::for_each would describe a specific kind of for-loop, and its usage makes the intent of the program more obvious. Note that the function argument (in bold) would be called inside std::for_each, which is in a different lexical scope than where the function argument is defined.

When compiling to machine code, a typical way to implement closure is to enclose the environment in a struct, let each function take a pointer to the environment as an implicit argument, and make the function pointer a pair of pointer to the code as well as pointer to the argument. In plain C, it would look like this:
typedef struct env {
  int total;
} env_t;

void anonymous_func(void *ctx, int x) {
  env_t *envp = (env_t *) ctx;
  envp->total += x;
}

typedef struct for_each_int_arg {
  void (*code)(void *, int);
  void *ctx;
} for_each_int_arg_t;

void for_each_int(int *first, int *last, for_each_int_arg_t f) {
  for ( ; first != last; first++)
    f.code(f.ctx, *first);
}

void sum(int *numbers, size_t n) {
  env_t env = { .total = 0 };
  for_each_int_arg func = {
    .code = anonymous_func,
    .ctx = (void *) &env,
  };

  for_each_int(numbers, numbers + n, func);
}
In fact, this technique is prevalent in C programs that take a callback in the form of a function pointer plus a context pointer. A profound implication for callbacks that do not take a context pointer is that the function argument to these higher-order functions can only refer to the global or static variable scope as its lexical context. Unless these functions do not effect external state (i.e. pure, effectless), this limitation makes it hard to encapsulate the state of the program into small reusable discrete units, and therefore a violation of abstraction.

Let's take a look at several C library calls and see what the implication means.
  • qsort(), bsearch(): the compar argument does not take context. It may be fine because compar is supposed to be pure, i.e. causing no side effects. However, neither qsort() nor bsearch() are very generic either; they assume the data to be laid out entirely in a C array. A more general version may require the compar function to take a context, for example when operating on the content of a huge file without reading the content entirely into memory.
  • atexit(): the function argument does not take context. If the exit routine is supposed to synchronize files, databases or remote processes, all the states would have to be global.
  • signal() and sigaction(): the signal handler does not take context, therefore they cannot be modularized. There are legitimate needs for modular signals, such as alarm and broken pipe detection. These will need to seek alternative workarounds.
  • pthread_key_create(): the destructor of a thread-specific key takes the thread-specific value as the context.
  • pthread_cleanup_push(): the cleanup routine takes a context.
  • pthread_create(): the start routine takes a context.
We can see that those library calls that don't take context in the callback can restrict the utility of these library calls.

In C++, the implicit this object pointer with a type parameter can be used together to compensate for the lack of context pointer, using a function object. The type parameter would specify a class that implements operator(), the function call operator. It will be given an implicit this pointer when called. This is how std::for_each worked when C++ did not support lexical context closure. You would have to rewrite the one line of anonymous function in C++0x into several lines of function object, incurring notational overhead that does not justify the axiomatization of a simple for loop construct.

Even so, there are cases where the lack of lexical context closure in C++ can impede abstraction and code reuse.

A while ago, I implemented a Splay tree where the pointer as well as the nodes are abstracted out.
template<class Ptr /* implements pointer<? implements binary_tree_node> */>
struct splay_tree /* implements binary_search_tree */ {
  template<class Compare /* implements unary_compare<Ptr> */>
  static ptr_type&
  find(Compare& cmp, ptr_type& root) throw() {
    ...
  }

  // More functions that operate on splay trees...
}
The idea is that whoever wants to use the Splay tree algorithm will write a node class that implements the binary_tree_node interface (provides accessor and mutator to left and right subtree pointers). The template argument Ptr is the pointer type to the node. The find() function takes a comparable object, which can either be a key or another node, which will be used to find a node in the tree that compares equal to the comparable object. The template argument Compare is the type of the comparable object.

Generalizing the pointer allows an alternative memory model to be used. In my case, I used the splay tree with linear_ptr<> which performs ownership tracking, so I know the algorithm preserves resource ownership. It has been used successfully this way. Another usage of alternative memory model is where a pointer is subscript into a particular array. This allows binary tree nodes allocated from a small array (say < 256 entries) to use just one byte for the pointer.

However, it doesn't work. We could try it like this:
template<typename Index, typename Tp, Tp *array>
class small_pointer {
 public:
  small_pointer(Index i = 0) throw() : i_(i) {}
  Tp& operator*() const throw() { return array[i_]; }
  Tp* operator->() const throw() { return &array[i_]; }
 private:
  Index i_;
};
A seasoned C++ programmer will correctly spot that array—a value of type Tp *— has to be a constant. It turns out that C++'s idea of constant means that the pointer must come from the memory location of a global or static variable. For example, you can't use small_pointer in a function like this.
void foo() {
  int numbers[10];
  for (int i = 0; i < 10; i++) {
    small_pointer<char, int, numbers> p(i);  // error!
    *p = i;
  }
}
C++ compiler will tell you that numbers cannot appear in a constant expression. However, if you think about it, after the code for operator*() is inlined, the statement *p = i would become numbers[i] = i, and it makes perfect sense. However, in the case that the C++ compiler decides to place opeartor*() into its own function, it would not have access to the int array because numbers is in another lexical scope. If C++ correctly supports lexical context closure, it would actually be able to accept any identifier as a pointer to a non-type argument of a template. You would be able to instantiate the small_pointer template class with a function scope identifier as well as a class scope identifier.

Without this abstraction, I would have to re-implement the Splay tree for tree nodes that want to use a compact pointer model because the nodes are stored in an array. This need might be more common than you think, given that embedded devices like Arduino is getting more popular these days. The difference between O(n) and O(lg n) operation can mean a lot to battery life, but O(n) algorithm is often being chosen due to the prohibitive complexity involved to get O(lg n) data structure working in a compact memory layout. I certainly do not look forward to re-implementing Splay tree for each possible compact node layout. This is something that the compiler should do for me.

Tuesday, June 15, 2010

Hazard in User Interface Design

Improper user interface design could lead to hazard, but hazardous user interface design is not limited to computers; it can happen in real life. In this case, we're talking about a traffic intersection.
Here is a photograph of the intersection. It looks innocuous enough.
However, upon closer look, there are two things to notice.
First, St. Marys St. is two way up until the overpass that crosses Mass Turnpike. It is indicated by the “DO NOT ENTER” and “ONE WAY” signs at the left corner. The traffic light hanging in the middle of the intersection indicates it's permitted to make a left or right turn. However, at the right corner, we see an non-directional green light.
The green light invites the intuition that it is okay to go straight, which is incorrect and inconsistent with the sign on the other side. A naive, optimistic driver may see the illuminated green light first because it is on the same side of the road and assume that it is okay to go straight. After all, through traffic is typical for most intersections. However, this particular intersection is unusual where three of the four ways of the intersection are bi-directional, but one of the four ways is unidirectional. The green light makes the atypical seem typical, and it encourages a driver to ignore other signs of unusual road condition.

One may argue that a good driver will take all road signs into account. However, a driver under stress, for example, one who is honked by a car behind, or a driver who is distracted by cellphones, could very well pick the wrong visual cue. The inconsistent traffic light is a hazard, and a flaw in the user interface design.

A proper redesign would replace the non-directional green light to left and right arrows, like one in the middle of the intersection. It could also keep the red light lit at all times just to be abundantly clear.

Thursday, June 10, 2010

Finding n Consecutive Bits of Ones in a Bit Vector in O(lg n) Time

We want to find out whether a bit vector X of length m has n consecutive bits of ones, as well as all the positions where we can find it.

For example, the following bit vector:
(msb) 11111111 01111111 00111111 00011111 (lsb)
Has 5 consecutive ones starting at bit 0, 6 consecutive ones at bit 8, 7 consecutive ones at bit 16, and 8 consecutive ones at bit 24. In addition, we count 4 consecutive ones at bit 1, 3 consecutive ones at bit 2, and so on. The overlaps are also considered. Therefore, if we were to find all positions with 6 consecutive ones, we could generate a positional bit mask like this:
(msb) 00000111 00000011 00000001 00000000 (lsb)
The result indicates that we can find 6 consecutive ones in the original bit vector if we start at bits 8, 16, 17, 24, 25, 26.

Of course, if we can only access one bit at a time, the best running time we can do is O(mn). Most computers provide bitwise operators to compute logical AND and OR on the matching bits of two bit vectors simultaneously, as well as bit shift operators. This, as we show, will dramatically reduce our running time.

Our first attempt is to simplify the problem of finding two consecutive ones, that is, bit i should be set if both bit i and bit i + 1 are ones. This is easy. We shift X to the right by one bit, and AND it against the original unshifted X. In C, this is written as X & (X >> 1). We can generalize the first attempt to find n consecutive ones by doing X & (X >> 1) & (X >> 2) & ... & (X >> (n - 1)). This naive algorithm takes O(n) time.

The part that gets really interesting is when we observe that 4 consecutive ones can be detected by 2 consecutive ones followed by another 2 consecutive ones. In the original example,
(msb) 11111111 01111111 00111111 00011111 (lsb) [n = 1] Y0 = X
(msb) 01111111 00111111 00011111 00001111 (lsb) [n = 2] Y1 = Y0 & (Y0 >> 1)
(msb) 00011111 00001111 00000111 00000011 (lsb) [n = 4] Y2 = Y1 & (Y1 >> 2)
(msb) 00000001 00000000 00000000 00000000 (lsb) [n = 8] Y3 = Y3 & (Y2 >> 4)
The reason why this works is that in computing Y2, we use the fact that Y1 has gaps of 2 consecutive zeroes to shorten our span of ones, and in Y3 uses the fact that Y2 has gaps of 4 consecutive zeroes, and so on. This is another type of SWAR algorithm.

We will keep doing this until we reach a power of 2 that is <= n. Then the last step we just apply a shift to truncate off the remainder of the bits.

In the original example, suppose we want to find n = 7 consecutive ones.
(msb) 11111111 01111111 00111111 00011111 (lsb) [n = 1] Y0 = X
(msb) 01111111 00111111 00011111 00001111 (lsb) [n = 2] Y1 = Y0 & (Y0 >> 1)
(msb) 00011111 00001111 00000111 00000011 (lsb) [n = 4] Y2 = Y1 & (Y1 >> 2)
(msb) 00000011 00000001 00000000 00000000 (lsb) [n = 7] R = Y2 & (Y2 >> 3)
The last step R, instead of shifting by 4, we only need to shift by 3.

The final algorithm is this:
typedef unsigned int bitvec_t;

bitvec_t consecutive_mask(bitvec_t x, unsigned int n) {
  unsigned int stop = prev_pow2(n);
  for (unsigned int i = 1; i < stop; i <<= 1)
    x &= (x >> i);
  if (stop < n)
    x &= (x >> (n - stop));
  return x;
}
The for-loop runs in O(lg n) time. We still need to show that prev_pow2(), which computes the power of two that is <= n, also runs efficiently. Again, this can be done using a SWAR algorithm.
unsigned int prev_pow2(unsigned int n) {
  return (fold_bits(n) >> 1) + 1;
}
where fold_bits() is the SWAR algorithm defined as follows:
unsigned int fold_bits(unsigned int x) {
  x |= x >> 1;
  x |= x >> 2;   /* last for 4-bit */
  x |= x >> 4;   /* last for 8-bit */
  x |= x >> 8;   /* last for 16-bit */
  x |= x >> 16;  /* last for 32-bit */
  x |= x >> 32;  /* last for 64-bit */
  return x;
}
The remark "last for k-bit" indicates that if x only has k bits, then that line is the last operation for the fold_bits algorithm. Since our n only has 5 or 6 bits, we only need to do up to "last for 8-bit." It turns out that this part is also O(lg n).

Here we presented the problem of finding the positions of all consecutive bits of ones in a bit vector, and an algorithm to do it in O(lg n) + O(lg n) = O(lg n) time. An application for this algorithm is dynamic storage allocation. Suppose I have each bit in the bit vector to indicate the availability of a storage cell, then this algorithm allows me to find places where I can reserve n consecutive storage cells in O(lg n) time.

Tuesday, June 8, 2010

pthread atexit()

The atexit() function registers a function to be called when a process terminates, via exit() or via return from main(). However, when a pthread is terminated, pthread_exit() does NOT call any of the atexit() functions. There is a similar pthread_cleanup_push() which does get called, but it is expected that pthread_cleanup_push() and pthread_cleanup_pop() are a matching pair in stack order, so it only works if pthread_exit() sits right in the middle between a pair of the push() and pop().

Fortunately, when a pthread exits, it calls the destructor function specified by pthread_key_create(). Although all threads need to use the same destructor, we can write a destructor that calls another function whose pointer is specified by the thread specific data, facilitating thread local destructor.
typedef void (*exit_t)(void);

void thread_local_destructor(void *arg) {
  exit_t destructor = (exit_t) arg;
  arg();
}
It should be a doable exercise to extend this to register an arbitrary number of functions to run at pthread exit.

Bridging the high level concept with low level detail

I was fascinated with this blog post written by an undergrad, exclaiming with excitement his teaching fellow's announcement saying that "binary search tree and merge sort are broken." From the detail that he cited (posted at Google Research blog, written by Joshua Bloch, the author of Effective Java book), I think he meant binary search, and not binary search tree. The problem has to do with overflow in this line:
int mid = (low + high) / 2;
The assumption is that low, mid and high are both indices to some array, with low <= mid <= high. When you add low and high, the result could overflow and become negative, and mid becomes negative. This causes array index out of bounds. In Java, the code will crash right away due to the run-time array bounds check. In C/C++, it might access bad memory location and cause undefined behavior, including non-termination. The problem will be a different one if you change int to unsigned int, in which case mid will be smaller than both low and high. Although mid will fall inside array bounds, it's still a violation of loop invariant, which could make the code non-terminating in C (Java doesn't have unsigned int type). The fix that Joshua suggested was to write:
int mid = low + (high - low) / 2;
Another fix, if you know that you're using a binary computer (which is most computer nowadays), is this aggregate magic hack to average two integers using bitwise operators.

I'd like to point out that the algorithm is not broken. An algorithm is a high-level concept of how to do the work. The algorithm is correct, and it's provably correct. What is broken is the program, whose capability is restricted by the finite machine they're run on. However, when you translate from high-level to low-level, the loss in computing power (from infinite state to finite state) causes program error. The problem with the kind of computer science education I'm familiar with is that we often teach computer science in the high-level concept, with the (flawed) mentality that mapping it to low-level program will be straightforward. Only someone who actually writes the program will know that the devil is in the details.

Some programming languages are better at bridging the gap than others; one that came to mind is Scheme, which uses arbitrary precision integer, as well as built-in rational number. It is a very useful language for describing algorithms because many of the machine-specific aspects are hidden by the language and its run-time system. However, programs written in Scheme will run slower because of the arbitrary precision integer arithmetic. ML programs using the IntInf package also share the same benefit and shortcoming.

What I think would be very useful is to write the algorithm as a template program that can be instantiated with different integer sizes, but let the compiler deduce its limits. For example, the binary search bug could be avoided if overflow does not happen. The compiler will generate a precondition requiring that the caller of that function must check that low + high <= INT_MAX. If int is 32-bit, then INT_MAX is 2147483647; if int is 16-bit, then INT_MAX is 32767. Furthermore, since 0 <= low <= high < array_length, it suffices to show that if array_length <= INT_MAX / 2, then the function will be safe. The last step is perhaps not deducible by the compiler, but a programmer can use that fact to convince the compiler that the limits are satisfied.

You can almost write such code in the programming language ATS because the compiler statically checks for integer constraints. You will just have to use #include template technique: prepare the algorithm in a separate include file, while each instantiation will define the concrete type int and the value INT_MAX before including the algorithm. However, the language default + operator does not check for overflow, so you will have to write a new declaration of the + operator function type with the appropriate integer constraint. Maybe I'll write a tutorial in another post.

Saturday, June 5, 2010

Parallel computing and the cost of TLB shoot-down

Translation lookaside buffer (TLB) is part of the CPU's virtual memory facility to cache the mapping of virtual memory page to physical memory page, which speeds up the lookup of virtual address to physical address. Without TLB, virtual memory would require several physical memory accesses for every virtual memory access for looking up from the page table. TLB makes virtual memory practical, and it is one of the earliest form of memory caching. Virtual memory gives the illusion that the system has more available memory than actually installed. The operating system automatically arbitrates the use of main memory, which alleviates the programmer the burden of overlay planning. It is said to dramatically increase programmer productivity. The way virtual memory works is described in detail in The Locality Principle.

In the very beginning, each process running on the operating system corresponds to one virtualized computing resource. It has exactly one CPU, and it runs in an isolated address space. On a shared-memory multi-processor system, each processor will run one process. They only need to synchronize (wait for each other) when a process causes the operating system to acquire or release physical memory; but then, only the processes that acquire or release memory needs to be synchronized. All other processes can run uninterrupted and in parallel.

When the operating system introduces multi-threading support, things get complicated. Originally, multi-threading allows concurrency in an I/O bound program, which alleviates the burden of the programmer to multiplex the control flow over I/O operations that happen concurrently. However, multi-threading also allows a CPU bound process, which as multiple threads of execution, to run across several processors, all of which would share the same address space. And in this case, each processor has its own virtual memory unit, in particular the TLB, that has to be synchronized whenever there is a change to address space mapping.

The synchronization generally only needs to happen when an existing mapping is modified, or when unmapping occurs. When creating a new mapping, it is guaranteed that the TLB entry for the new mapping wouldn't exist, so no synchronization is needed. If the TLB is not synchronized, then other threads may still have access to the old mapping, which might be a physical memory page reused for other purpose. The synchronization uses a technique called TLB shoot-down.

When the mapping change happens, the processor performs the change, flags the shoot-down request at a shared memory location, and then triggers an Inter-Processor Interrupt (IPI) to other processors. When those processors receive the interrupt, they clear the TLB cache, and acknowledge that the request has been completed. The original processor has to wait until all processors acknowledge the completion.

One can imagine why TLB shoot-down is costly. The most direct consequence is that TLB has to be rebuilt, which is only cost-effective if the cost is amortized over time. Frequent TLB shoot-down will cause the virtual memory to perform poorly.

Furthermore, not only the processor that makes the change has to synchronize with the change; all the processors that share the address space are affected, regardless whether they are interested in the change or not. CPU-bound processes that want to scale typically have their threads all dealing with a private memory area in a shared address space. This is to take advantage of the CPU cache, which they use for local memory, so that bus traffic may be reduced. Bus traffic is the bottleneck of scalability in any parallel computer. However, TLB shoot-down will slow down these other threads even if they effectively use their own private memory area. The cost of TLB shoot-down increases as the number of CPUs increase. This is studied extensively and serves as a motivation of a multi-kernel called Barrelfish.

It looks like if you're looking to write a shared-memory parallel program that is CPU-bound, and want to scale to a large number of processors, you would want to reduce the number of unmap operations. Of course, the alternative is break down the threads and revert to using several address isolated processes that communicate using well-defined channels such as MPI. This will also work well on distributed systems.

Thursday, June 3, 2010

strxcpy—More Consistent, Safer String Copying and Concatenation

This is now published as a Google Project under the name strxcpy. The full motivation article that was previously found in this blog post has been relocated here.

The C language makes it easy to write programs with buffer overflow problems, but there is still something the library can do to help the programmer make fewer mistakes. For this reason, we propose strxcpy(), a variant of strcpy() and strcat(), which combines both string copying and concatenation.
char *strxcpy(char *dest, const char *dest_end, const char *src, size_t n);

The function takes two parameters to define the buffer, dest and dest_end. It guarantees that it will not overwrite the memory location past dest_end, and the string at dest is always NUL terminated. Furthermore, it does not require dest to be a NUL terminated string. The return value is a character pointer to the NUL terminator of the dest string, so that we can easily append another string to it by calling strxcpy() again.

We also propose the sprintf() variants using the same idiom.
char *vsxprintf(char *dest, const char *dest_end, const char *fmt, va_list ap);
char *sxprintf(char *dest, const char *dest_end, const char *fmt, ...);

See the motivation for buffer overflow problems.