Life of a Computer Scientist: Bridging the high level concept with low level detail

I was fascinated with this blog post written by an undergrad, exclaiming with excitement his teaching fellow's announcement saying that "binary search tree and merge sort are broken." From the detail that he cited (posted at Google Research blog, written by Joshua Bloch, the author of Effective Java book), I think he meant binary search, and not binary search tree. The problem has to do with overflow in this line:

int mid = (low + high) / 2;

The assumption is that low, mid and high are both indices to some array, with low <= mid <= high. When you add low and high, the result could overflow and become negative, and mid becomes negative. This causes array index out of bounds. In Java, the code will crash right away due to the run-time array bounds check. In C/C++, it might access bad memory location and cause undefined behavior, including non-termination. The problem will be a different one if you change int to unsigned int, in which case mid will be smaller than both low and high. Although mid will fall inside array bounds, it's still a violation of loop invariant, which could make the code non-terminating in C (Java doesn't have unsigned int type). The fix that Joshua suggested was to write:

int mid = low + (high - low) / 2;

Another fix, if you know that you're using a binary computer (which is most computer nowadays), is this aggregate magic hack to average two integers using bitwise operators.

I'd like to point out that the algorithm is not broken. An algorithm is a high-level concept of how to do the work. The algorithm is correct, and it's provably correct. What is broken is the program, whose capability is restricted by the finite machine they're run on. However, when you translate from high-level to low-level, the loss in computing power (from infinite state to finite state) causes program error. The problem with the kind of computer science education I'm familiar with is that we often teach computer science in the high-level concept, with the (flawed) mentality that mapping it to low-level program will be straightforward. Only someone who actually writes the program will know that the devil is in the details.

Some programming languages are better at bridging the gap than others; one that came to mind is Scheme, which uses arbitrary precision integer, as well as built-in rational number. It is a very useful language for describing algorithms because many of the machine-specific aspects are hidden by the language and its run-time system. However, programs written in Scheme will run slower because of the arbitrary precision integer arithmetic. ML programs using the IntInf package also share the same benefit and shortcoming.

What I think would be very useful is to write the algorithm as a template program that can be instantiated with different integer sizes, but let the compiler deduce its limits. For example, the binary search bug could be avoided if overflow does not happen. The compiler will generate a precondition requiring that the caller of that function must check that low + high <= INT_MAX. If int is 32-bit, then INT_MAX is 2147483647; if int is 16-bit, then INT_MAX is 32767. Furthermore, since 0 <= low <= high < array_length, it suffices to show that if array_length <= INT_MAX / 2, then the function will be safe. The last step is perhaps not deducible by the compiler, but a programmer can use that fact to convince the compiler that the limits are satisfied.

You can almost write such code in the programming language ATS because the compiler statically checks for integer constraints. You will just have to use #include template technique: prepare the algorithm in a separate include file, while each instantiation will define the concrete type int and the value INT_MAX before including the algorithm. However, the language default + operator does not check for overflow, so you will have to write a new declaration of the + operator function type with the appropriate integer constraint. Maybe I'll write a tutorial in another post.

Life of a Computer Scientist

Tuesday, June 8, 2010

Bridging the high level concept with low level detail

No comments: