Wednesday, March 31, 2010

Parallel Computing with Memoization

About two years ago, I studied the underlying implementation of Cilk in detail, and in the terminology that I can remember, these are the novel features of Cilk that makes it unique among its competitors.

  • Work stealing based on deque, which always steals the task more likely to generate more work. The worker thread essentially does a depth-first traversal on the call graph, but let another worker steal from a point closer to the root and continue the traversal.
  • Continuation passing style transformation of parallelized cilk routine, so function call can be suspended and put on deque. Of course, the activation frame (called Cactus stack in Cilk) now lives on the heap.
  • Its usage of non-blocking data structure (which they called the THE protocol).

And unlike OpenMP, which parallelizes only embarrassingly parallel constructs such as the for-loop, Cilk can tackle much more complicated recursive functions. Cilk was originally designed to compute Chess moves, so the spawn and sync paradigm fits well with the recursive call structure of the Minimax algorithm. In order to support alpha-beta pruning, Cilk also has the ability to abort.

Minimax algorithm is a dynamic programming problem, that you can reduce the size of the search tree if you avoid computation that has already been done before, using memoization. However, memoization has to be applied carefully. If you use a global memoization table for all worker threads (in shared memory), the amount of sharing causes contention on the bus due to cache coherency. The cost of sharing the result of a computation probably out weights the cost of recomputing it.

Case in point, I tried to memoize fib(n) with Cilk. Although it did run much faster than non-memoized version (which runs in exponential time, versus memoization which computes fib(n) in linear time), it ran slower on multiple processors than on a single processor.

If we were to provide memoized parallel computation, each worker thread should have its own memoization table. Unfortunately, Cilk does not expose the thread ID of the current worker thread. Perhaps using thread local storage could work, since we know that Cilk-5 uses POSIX threads. However, the Cilk paradigm is not limited to shared-memory multi-processor computers. At one point, Cilk ran on distributed systems as well. It is much better to build memoization into the language.

It would be a nice research idea, but it's only useful if only people would use more dynamic programming algorithms to solve their problems.

Update (6/8/2010): I just found out about simulating coroutines in C, based on Tim Duff's device. It is exactly what the Cilk compiler tries to do to a cilkified function.

Update (2/9/2011): Further reading: Lock-free parallel dynamic programming provides some useful insight to this idea.
Post a Comment