Wednesday, March 31, 2010

Parallel Computing with Memoization

About two years ago, I studied the underlying implementation of Cilk in detail, and in the terminology that I can remember, these are the novel features of Cilk that makes it unique among its competitors.

  • Work stealing based on deque, which always steals the task more likely to generate more work. The worker thread essentially does a depth-first traversal on the call graph, but let another worker steal from a point closer to the root and continue the traversal.
  • Continuation passing style transformation of parallelized cilk routine, so function call can be suspended and put on deque. Of course, the activation frame (called Cactus stack in Cilk) now lives on the heap.
  • Its usage of non-blocking data structure (which they called the THE protocol).

And unlike OpenMP, which parallelizes only embarrassingly parallel constructs such as the for-loop, Cilk can tackle much more complicated recursive functions. Cilk was originally designed to compute Chess moves, so the spawn and sync paradigm fits well with the recursive call structure of the Minimax algorithm. In order to support alpha-beta pruning, Cilk also has the ability to abort.

Minimax algorithm is a dynamic programming problem, that you can reduce the size of the search tree if you avoid computation that has already been done before, using memoization. However, memoization has to be applied carefully. If you use a global memoization table for all worker threads (in shared memory), the amount of sharing causes contention on the bus due to cache coherency. The cost of sharing the result of a computation probably out weights the cost of recomputing it.

Case in point, I tried to memoize fib(n) with Cilk. Although it did run much faster than non-memoized version (which runs in exponential time, versus memoization which computes fib(n) in linear time), it ran slower on multiple processors than on a single processor.

If we were to provide memoized parallel computation, each worker thread should have its own memoization table. Unfortunately, Cilk does not expose the thread ID of the current worker thread. Perhaps using thread local storage could work, since we know that Cilk-5 uses POSIX threads. However, the Cilk paradigm is not limited to shared-memory multi-processor computers. At one point, Cilk ran on distributed systems as well. It is much better to build memoization into the language.

It would be a nice research idea, but it's only useful if only people would use more dynamic programming algorithms to solve their problems.

Update (6/8/2010): I just found out about simulating coroutines in C, based on Tim Duff's device. It is exactly what the Cilk compiler tries to do to a cilkified function.

Update (2/9/2011): Further reading: Lock-free parallel dynamic programming provides some useful insight to this idea.

Friday, March 26, 2010

Preparing EC2 AMI for X

I'm taking a note for myself on how to prepare a Windows EC2 AMI for software X.

Prepare the installation volume.
  • Start (or use an existing) instance of Windows Server 2003 (c1.small seems to work).
  • Download compressed installation archive to system disk. Note: IE needs double the amount of space to download a file. It first saves a copy under Temporary Internet Files, then copies the file to destination. Use another browser where appropriate.
  • Create an EBS volume for the unpacked installation files, attach it, and initialize the file system.
  • Unpack installation files to the EBS volume; make necessary modifications.
  • Terminate the instance, since the system disk is tainted.
  • Make a snapshot of the installation volume if desired.
Prepare the AMI
  • Start a new instance of Windows Server 2003 using an Amazon base AMI.
  • Mount the installation volume.
  • Install the software.
  • Unmount the installation volume.
  • Do NOT defragment the system volume! See below.
  • Make necessary modification to the system (e.g. raise Terminal Server color depth limit to 32-bits).
  • Use EC2 Config utility, make the password reset on next boot, then let it use sysprep to shutdown the computer for imaging.
Use the EC2 console to make the AMI. Both the installation volume and its snapshot may be discarded after the AMI is made. Leave the AMI's system disk snapshot intact.

The reason we do this in two phases is to ensure that the file system for the instance we want to create an image suffers the least amount of modification, only those necessary to install the software in question. Since snapshots are incremental (they can refer to the snapshot of the base AMI system disk), this minimizes the size of the snapshot we make. If you defragment the system volume, the defragmentation could cause the volume to differ greatly from the original, hence the incremental snapshot would become much larger.

Sunday, March 21, 2010

Ordered Intervals

Although we have efficient data structures to express sets, such as balanced binary search tree and sparse hash table, some sets are most efficiently expressed as an interval. An example is the character set, which is typically used for regular expression parsing, e.g. the set [0-9A-Fa-f] denotes characters for hexadecimal number digits. Historically, ASCII only has 127 characters, so a direct-index table would suffice; the table could perhaps be bitmapped, where each bit represents one code point, for a total of 16 bytes in the table. The state transition table for finite state automata representing a regular expression will only use a couple of hundred bytes. However, supporting regular expression parsing for Unicode is more challenging. Unicode currently allocates 20-bits per character (Unicode 5.0), and a typical set could easily contain several dozen code points, so a direct-mapped table certainly will not work well (one bitmapped row in the state transition table would be 128KB). A plain old list of character interval would work much better.

For the purpose of discussion, let's forget about character sets.

Assume we have a set expressed by a set of intervals, so we determine if a value is a member of the set by examine if the value lies inside any of these intervals. Set union is straightforwardly defined by union of the set of intervals. We'll focus on set membership for now and ignore set intersection. If intervals are arranged in a list, searching still takes linear time. In order to allow binary searching in logarithm time, we need a way to order these intervals.

Ordering is tricky if intervals overlap. Fortunately, overlapping intervals can be trivially combined, e.g. [5, 10] ∪ [7, 12] = [5, 12]. However, there is still a corner case. How can we tell if the interval [5, 9] ∪ [10, 14] equals to [5, 14] or not? It depends. We can combine them if the intervals range over natural numbers; not if the intervals range over real numbers. However, we shouldn't make any assumption about the continuity of elements. In order to avoid corner case like this, we make one side of the interval an open interval. It is customary in computer science to use intervals like [a, b), where a is inclusive, and b is exclusive, because for-loops are usually written this way.

We'll have a rule saying [a, b) ∪ [c, d) = [a, d) if b ≥ c. In particular, [a, b) ∪ [b, c) = [a, c).

The rest of interval tree implementation can be seen in Introduction of Algorithms, pages 311-315.

Tuesday, March 16, 2010

PDF Output with Inkscape 0.46 + Cairo 1.2.4 (RHEL)

I've been having problem producing PDF poster using Inkscape and Cairo that ships with RHEL 5/Centos 5. The default "PDF via Cairo" produces some of the drawings off their designated position when viewed on Mac OS X using Preview. The only option I found working is to save as "Postscript via Cairo" which correctly embeds the font and produces an output exactly as shown, but the bounding box information is lost when I do ps2pdf.

To fix the bounding box, I opened the PostScript output, and it shows the bounding box in the first few lines:

%%Creator: cairo (
%%CreationDate: Tue Mar 16 17:36:36 2010
%%Pages: 1
%%BoundingBox: 0 0 2880 2160
%%DocumentData: Clean7Bit
%%LanguageLevel: 2

Then, when doing ps2pdf, I have to manually specify the device width and height like this:


This produces output.pdf with the correct paper size. By the way, the points are in PostScript device unit, which is 1/72 of an inch. The poster is 40" by 30".

Friday, March 12, 2010

Use try-finally to prevent resource leak

This is not a tutorial, but a critique of a programming language feature known as try-finally.

Mom always tells you to clean up after yourself, or to put things back to where it was when you're done with it. If you don't do that, your room becomes a mess, and your stuff gets lost. A good programmer also follow a similar rule: if you acquired the ownership of an object, you have to release it. A program that doesn't follow the rule leaks resource every here and there, and it consumes more and more resource without being able to reuse them. Sharing might be a virtue in human etiquette, but it makes resource tracking difficult, and poor tracking will eventually lead to resource leak, both in the real world and inside the computer.

The most prominent kind of resource leak is memory leak, and it is the kind of leak most visible to the user thanks to how Task Manager (or top/ps in Unix) shows you the amount of memory used per program. The symptom is that the program uses more and more memory over time, until the operating system can no longer supply its insatiable gluttony, at which point the program crashes in a nasty way. (Greedy humans will eventually crash in a similar way—a forewarning to people who wanted to work for Google in order to indulge in its unlimited supply of gourmet food).

But even if a programmer is in good faith to clean up after himself, some programming languages make it harder than others to do it. For example, the following construct arises often in a non-trivial program:

thing_t *obj = acquire_thing(...);
if (...) {
  return ...;
} else {

The reason we could not factor release_thing(obj) out of the two branches of if-then-else is because the then-clause has a control flow that escapes current scope—the return statement, which could also be a continue or break statement, and our discussion still applies. Fortunately, in this case, the code is manageable because:
  • The clean up only requires releasing one object.
  • There are only two conditional branches, then and else.
Anything more than that, the code starts to become unmanageable.

Some languages make the code easier to write by providing the try-finally construct, so we could rewrite our code like this:

thing_t *obj = acquire_thing(...);
try {
  if (...) {
    return ...;
  } else {
} finally {

The finally block is executed whenever the control flow escapes the try block. This allows us to write the clean up code only once. You're not required to use exceptions in order to use try-finally.

If your language provides this syntax, the finally statement is the only safe place you put any clean-up code. This is a very useful facility to combat any type of resource leak, especially memory leak. However, taking a look at languages that feature exception handling, many of the languages with finally support are garbage collected, such as C#, Java, JavaScript, Python, Ruby. In particular, C# and Java programs tend to put clean-up code in the object destructor, but garbage collection prevents you from explicitly destroy an object at a precise moment of time. The clean-up code is delayed for an unspecified duration, until the collection cycle kicks in. Rather than using try-finally, you might as well leak the garbage and wait for its collection. This renders try-finally useless for C# and Java—they have no business implementing try-finally in their language.

The only manual memory managed languages with try-finally support are Delphi (Object Pascal) and Objective-C. Try-finally is absent from the C language that is in dire need of the feature.

An interesting exception is C++, which implicitly calls an object's destructor when the control flow leaves the scope the object is declared in, known as resource acquisition is initialization. If we have a pointer to an object allocated in heap using operator new, we could use scoped_ptr<> to delete the pointer when the object leaves the scope. Apparently auto_ptr<> in STL also works, where the copy and assignment semantics moves object ownership from one pointer to another. Some people prefer to outright disallow copying and assignment, so they use scoped_ptr<> instead.

Although scoped_ptr<> and auto_ptr<> facilitates memory resource clean-up, they are not usable with other kinds of resources such as file descriptors and locks, which need other types of clean-up code. Also, any C library managed resource such as graphics context need their own clean-up code. This is why, in C++, it is fashionable to wrap C system or library calls as C++ objects. This requires a humongous start-up effort to want to use C++ along with a C library!

Finally (finally!), I conclude with two recommendations for the language design committee:
  • The C language needs try-finally, even if it does not need to support exception handling.
  • The C++ language also needs try-finally even if it does have exception handling syntax. This makes interoperability with resources managed in the C language easier.

Saturday, March 6, 2010

Understanding Spatial Locality

Last week a paper was discussed at the operating systems reading group (website is out of date), about a garbage collector called the Mapping Collector. It is a generational garbage collector that attempts to address the issue that mark-and-compact and copying garbage collectors spends most of the overhead moving live objects in memory. The MC still copies objects from nursery to tenure because the assumption is that most objects die in nursery, so copying nursery objects is cheap. It avoids copying only in the tenure—it figures out which memory page is completely occupied by garbage, and tells the operating system to unmap the page in order to free the memory. Of course, there is a fallback strategy where if the unmapping does not reclaim enough memory, then it would have to compact the tenure using mark-and-compact Compressor collector.

The design implies that MC cannot do more copying than the pure fallback mechanism, but it could reclaim less memory because partially occupied pages cannot be freed. The benchmark confirms the result, but the question arises if there is a way to write a program to trigger worst case performance. MC relies on the assumption that garbage in the tenure would clump into clusters big enough to span memory pages. If the memory page contains even one small live object, it would not be unmapped. The idea is like a nail house phenomenon (urban planners share similar problems of dynamic storage allocation).

To support the assumption, MC cites the "widely-known phenomenon that objects with similar lifetimes tend to exhibit spatial locality in the heap," which is an observation advocated by P. R. Wilson et al., and "in particular, we find that dead objects often occur in large clusters." The question now becomes this: what makes the deaths happen in large clusters? If we willfully distribute the placement of these objects, certainly objects that would previously die in large clusters would die in different places now. A program has no control over where it wants to allocate objects, and it has to take into account that all objects have a second "birth" after it survives the nursery and is moved to the tenure heap. A worst-case program could assume that objects are copied in depth-first search order, infer the layout of these objects in tenure, and make them die in such a way that scatters garbage in memory.

This thought exercise might seem contrived, but it does bring out a point about the transfer of objects from nursery to tenure. Depth-first search order approximates the way many programs traverse objects in memory, and object references tend to relate objects of similar usefulness together. In other words, copying objects in this order tends to create spatial locality. This property of generational garbage collection favors the Mapping Collector. In order for MC to perform well, it needs to be able to gather at least a few pages full of related live objects—assuming they would be freed together—before the nursery gets full. The size of the nursery becomes a factor in MC's performance.

This presents an interesting trade-off, namely the more work MC does in the nursery collection, the more memory it is likely to be able to free later. This is probably something the original authors of the work did not anticipate.

Wednesday, March 3, 2010

SyslogNotify: A syslog relay for desktop notification framework such as Growl

I just created a new project, SyslogNotify, on Google Code:
Syslog, from the days of BSD Unix, has been collecting log messages from processes in real-time for decades. They are saved to the disk by default. This tradition has continued to Mac OS X. Meanwhile, desktop notification framework such as Growl allows application messages to be displayed in a non-intrusive way. Many modern applications such as Disk Utility and Installer lack native Growl support, but they emit syslog entries that are indicative of the progress they are making, which could be useful for display on Growl.

SyslogNotify is a daemon written in Python that understands syslogd UDP protocol defined by RFC 3164, and forwards syslog messages to desktop notification. This allows many applications, kernel, and system services instant presence on the desktop.
I started working on this because sometimes my computer gets scanned by failed SSH login attempts, and they do nothing but to keep my computer busy. I was going to work on an automatic blocker, but decided that it was a bit risky to manipulate /etc/hosts.deny programmatically using a setuid root script. I settled on this project for the time being that allows me to monitor SSH login attempts in progress on Growl, so I can manually block the offender.