Life of a Computer Scientist: May 2010

Monday, May 31, 2010

Estimation of Safe Walking Route using Google Street View

Woman Follows Google Maps “Walking” Directions, Gets Hit, Sues. Despite how she claims that her blackberry did not display the usual warning “Walking directions are in beta. Use caution – This route may be missing sidewalks or pedestrian paths,” anyone with a common sense could see from street view that there is no sidewalk, and the path might be dangerous for a pedestrian to cross.

But can Google use street view to automatically assess the walkability of a path? Maybe. At least on busy streets, you will find pedestrians. A lot of image recognition research has gone into detecting people in an image, so if you detect people in a street view, it is likely a walkable street. Even if there is no people, it might be possible to detect common pavement painting pattern such as pedestrian cross-walk, or pedestrian walk lights. A street with those computer detectable elements is very likely to be walk-safe.

GitFarm source code?

It is a popular request to GitFarm for its source code, but these requests are mostly ignored by its provider, Bo (yboycn). GitFarm allows you host a Git repository on Google AppEngine. It turns out that if you know a bit of Java, you can put together something like GitFarm over the weekend. Most of the pieces you need are already available as open source.

JGit, Git implemented in Java (EDL licensed).
GaeVFS, a filesystem for Google AppEngine using datastore and memcache API (Apache licensed).

And if you build your Git hosting on AppEngine that way, you are not compelled to publish your source code. Both EDL and Apache License are only triggered by distributing source or object code, not software as a service. You are not even compelled to reproduce a copyright notice for providing software as a service.

Friday, May 28, 2010

Tracking Object Ownership in C++

Ever since the beginning when people program in C/C++, the need to track object ownership is recognized, but there has been a lot of confusions. We need ownership tracking in order to know who is responsible for freeing an object. This is essential for preventing memory leaks, since C++ does not typically use garbage collection (though it can use something like Boehm GC). Ownership tracking is also essential for preventing double-free, which can corrupt the memory allocator's internal data structure.

Ownership tracking does not sufficiently prevent all memory related error. For example, another part of the program could retain an aliased const reference to an object and use it after the true owner frees it. After the memory occupied by the object is returned to the allocator, it could be subsequently used for other purpose, which could be corrupted when someone modifies the aliased original object that no longer exists.

We will need to use the concept of linearity if we also want to prevent aliasing, but a lot of seasoned C++ programmers don't understand this. It is evident, for example, from their disdain about std::auto_ptr<>: when an STL algorithm such as sort() is broken when you use auto_ptr, they blame auto_ptr for the culprit. However, I would argue that the culprit is sort(), which creates an alias for an item in the container and abandons it without proof that it is safe to do so (perhaps because by default, C++ treats everything as plain old data, which means copying and assignment has no side-effect). If an algorithm observes object linearity, then it will work with auto_ptr<>. There has been significant research in the programming language ATS which strictly enforces linearity in its type system, and there is a lot of code implementing common algorithms that satisfy linearity.

While it is great if you can use a new language with built-in linearity checking, what if you're stuck with C++ (e.g. I miss template meta-programming and curiously recurring template pattern)?

The real problem I have with auto_ptr<> is that it automatically frees the object when the auto_ptr<> object is destructed. This means that non-linear code can run for a while, erroneously freeing objects until everything mysteriously disappears, and you will be puzzled by what is going on. Distinguishing move semantic and copy semantic is a start: it alleviates the need of the auto_ptr_ref hack (otherwise auto_ptr<> cannot be returned from functions due to the non-const copying constructor). But move semantics doesn't enforce linearity; items can still mysteriously disappear after they're moved out of the container but not moved back in.

C++ type system does not have the power to enforce linearity during compile time, but the next best thing to do is to make programs crash early when non-linearity is detected. I've been using a home-brew linear_ptr<> class which the only difference from auto_ptr<> is that, whenever there is a delete p, I remove it and replace it with assert(p == NULL). Regardless of move or copy semantics, a linear pointer only tracks the ownership but never frees an object. I've been using this ubiquitously in an implementation of a memory allocator with great success.

In using linear_ptr<>, I came to realize that a lot of times, a function or method doesn't intend to take ownership of the object passed in as an argument, nor does it alias the object pointer, but we want to have call-by-reference argument semantics. In these cases, I use the idiom const linear_ptr<T>& which means that the pointer itself should not be modified (still points to the same object, not taking ownership) but the object itself is non-const, so it could be mutated.

It turns out that in a regular C++ program, you don't need to use linear pointer to express the intention that a function does not intend to take ownership of an object. Just use a reference! The key idea is that delete takes a pointer, not a reference. Since a reference cannot be deleted, it signifies the fact that the caller does not pass ownership of the object to the function. Furthermore, it signifies that the reference cannot be NULL, so the object has to exist prior to the call. The caller loans the object to the function; the function borrows the object from the caller. Of course, the function should still refrain from converting the reference to a pointer or making an alias of the reference, but it is much easier for a programmer to self-discipline in a local context of a function.

The role of references in C++ is often under-appreciated. For example, Google C++ Style Guide imposes a rule on references that prevents you from writing a function that wishes to mutate an object but will not claim ownership.

However, it is important to note that, unlike pointers, C++ references cannot be reassigned. Assignment on a reference will be applied on the object being referenced to, not the reference itself. This means that if you want to repeatedly borrow objects in a recursively defined data structure (e.g. singly linked list, binary search tree), you will have to write recursive functions rather than doing it in a loop.

In the case where you really want to write a loop that repeatedly borrows objects from a data structure, you can use a loan pointer, where the object pointer is cast to a pointer of a loan class of that object. A loan class is a mixin that merely disables the constructor and destructor by making them private.

template<class superTp>
class loan : public superTp {
 private:
  // Disallow constructor and destructor.
  loan();
  ~loan();
};

It most certainly should disallow copy constructor as well, but this is left for an exercise to the reader. When we want to signify the intention to retain ownership, we will make a loan pointer using the following function:

template<class superTp>
loan<superTp> *make_loan(superTp *objp) {
  return static_cast<loan<superTp>*>(objp);
}

A similar function can be written for linear_ptr<> as well, to convert from a const linear_ptr<T>& to loan<T> *.

A loan pointer can be reassigned to reference a different object, but the usual caution about aliasing applies. A function should not retain the alias beyond the scope of the function. I rarely use the loan<T> * idiom myself because C++ reference works most of the times for me to write a function with the intent not to claim object ownership from the owner.

In conclusion, yes, there is a difference between pointer and reference in C++, and the difference is often under-appreciated. You may use pointer argument to signal the intent of the function to take ownership from the caller; but when you use a reference argument, the caller keeps ownership, and the function merely borrows the object for use in the function scope. This takes care of ownership dispute in function call. However, outside of a function call, auto_ptr<> is not enough to enforce linearity in code. Since C++ type system does not statically enforce linearity, the next best alternative is to use linear_ptr<> that detects non-linearity and crashes the program in run-time the earliest possible. These facilities can be used together to help programmers write linear code that is free of memory leak and memory corruption problems due to double-free and free-after-use.

Wednesday, May 26, 2010

Passing File Descriptor over Unix Domain Sockets

This is a follow-up to my rant in Secrets (everyone should know) About Running Web Servers. I made a few recommendations about structuring a server written in C/C++ and in Java. Namely, the server will consist of several worker processes, and a monitor process that dispatches the work. The overall design is the same for C/C++ and Java, but the motivation and rationale are different.

C/C++: since memory leak for programs written in this language is unavoidable, there should be a way for a worker process to restart periodically (after a certain number of requests). We need to be able to "switch on" and "switch off" any worker process in a pool of processes without exposing downtime.
Java: multi-threading interacts poorly with generational garbage collection, so if we don't want to resolve to fudging nursery size, worker process should be single threaded. However, we will be running several concurrent worker processes so we can take advantage of the multi-processor hardware.

Both of these require having a lightweight "monitor" process that listens for network sockets, passes an incoming connection to one of those workers, and monitor the health of these workers. We don't want to fork on socket accept() because some application's startup cost could be quite high (e.g. JVM).

It might be tempting to use FastCGI, but it turns out that the FastCGI protocol requires copying from worker to the monitor, which copies the response back to the client. We want to avoid any redundant copying at all cost. This means once the monitor accepts a listening socket, it will pass the socket directly to a worker, and leave it up to the worker to communicate with the client.

It turns out that folks who designed BSD Unix had already foreseen this scenario. Nowadays, it is possible to do so at least on AIX and on Linux. The basic idea is to establish a Unix Domain socket between monitor and workers. The monitor sends opened socket "credentials" to the worker using a special flag to sendmsg(). The worker receives the socket and works with it as usual.

Since this technique is platform specific (a quick glance on Windows Socket API function list does not reveal any facilities for passing open sockets between processes), we can safely bet that Java does not support it in the Java API. Unix Domain Socket implementations in JNI exist; unfortunately, neither juds nor junixsocket support passing file descriptors. This actually makes a nice business opportunity, selling Java monitor-worker based server framework.

The same can be said about C/C++. Although I'm pretty sure someone has already rolled their own monitor-worker implementation that does what I described here, the effort involved is non-trivial, so I'm willing to bet that someone is going to be willing to pay for such framework.

Monday, May 24, 2010

Mac OS X thread local storage

Historically, Mac OS X uses the Mach-O executable format, which does not support thread local storage data section found on most operating systems using ELF executable format (here is how ELF handles thread local storage as documented by Ulrich Drepper). If you try to compile a program using __thread modifier, you'd get the following error.

$ cat tls.c 
static int __thread x;
int main() { return 0; }
$ gcc tls.c 
tls.c:1: error: thread-local storage not supported for this target

However, it has been suggested that pthread_getspecific() has a fast implementation on Mac OS X. The pthreads thread specific functions are POSIX API facility for thread local storage (TLS).

On Mac OS X, pthread_getspecific() for i386 takes exactly three instructions. It is similar to ELF handling of TLS on i386, in the sense that both use the GS segment to point to a thread control block. However, the difference is that ELF only stores the thread pointer in the thread control block, which is then used to calculate the address of the thread local storage data section for that thread. On Mac OS X, pthread specific values are allocated from the thread control block in the GS segment. This saves another indirect reference, so that could be faster.

The way Mac OS X implements pthread_getspecific() for ppc target is more complicated. First, it looks up from the COMMPAGE (defined in cpu_capabilities.h) the address to several possible pthread_getspecific() functions implemented by the mach kernel, and branch there. From there, thread control block is either obtained from SPRG3 (special purpose register #3), or by an ultra-fast trap. The SPRG3 points to thread control block where pthread specifics are allocated from, so the approach is comparable to using GS segment on i386. The ultra-fast trap approach, despite its name, is a slower fall-back approach that appeals to the task scheduler to find out which thread is currently running.

There is no reason why pthread_getspecific() couldn't be implemented as efficiently as compiler's thread local storage variables. However, with __thread, the compiler could cache the address for the thread local variable instance and reuse it in the current function. If several functions are inlined, all of them could use the same pre-computed address. With pthread_getspecific(), the compiler doesn't know that the address to the pthread specific variable can be reused. It has to call the function again whenever the value is asked. For this reason, Mac OS X also provides inlined version of pthread_getspecific() to get rid of the function call overhead, as well as allowing the compiler to pre-compute the offset into the thread control block whenever possible.

I don't know how other operating systems implement pthread_getspecific(), but if they already support thread local storage variables, they should be able to implement pthread specific like this to achieve the efficiency as expected on Mac OS X:

// Assume we only need a finite number of slots.
#define MAX_SLOTS 512

// For storing thread specific values, using thread local storage.
__thread static void *specifics[MAX_SLOTS];

// For storing allocation status for a pthread-specific key.
static struct {
  int created;
  void (*destructor)(void *);
} key_descriptors[MAX_SLOTS];

// For locking the routine that has to manipulate key_descriptors,
// e.g. pthread_key_create() and pthread_key_delete().
pthread_mutex_t key_descriptor_lock;

This is inspired by pthread_tsd.c in Mac OS X Libc. Notice that, unlike their Libc, we can implement pthread_getspecific() and pthread_setspecific() using trivial inline functions or macros that access the specifics array.

Therefore, my recommendation is to use POSIX pthread specific in your program, which will work on any POSIX including Mac OS X. If your operating system has a slow POSIX pthread specific implementation, but the compiler has thread local storage support, then you can roll your fast pthread specific like shown above.

Thursday, May 20, 2010

Ambient lighting

Put these LED lighting strips in a clear water hose for ambient lighting project?

Secrets (everyone should know) About Running Web Servers

Datacenters are expanding, and a lot of them host web servers. It is only imperative that web servers run fast and lean. An ineffective web server implementation can cost you money in many ways.

You will need beefier hardware; they are more expensive, draw more power, and generate a lot of heat (requires high cooling cost).
You will need more servers. In addition to all the above disadvantages, they take more room, which costs you more in rent.

Although network bandwidth charges can be expensive, they scale nicely with the amount of traffic you receive. There are still things you can do to halve the bandwidth use, like compressing HTTP content. Other things like combining packet output can shave a few bytes from packet headers, and it improves latency, but these are not the main focus for this article. I want to talk about is, what can make you run a web server that is unnecessarily costly, purely due to the software that runs on the server.

First of all, PHP is expensive to run (i.e. Mark Zuckerberg is an idiot). In the very beginning, there was a web server that can serve static files. Then there was a web server that can serve dynamic content by spawning child processes and delegating requests to them using the CGI protocol. You can write CGI in any language, including C, C++, Perl, and even shell script. This satisfies applications where Server Side Include does not suffice. But for certain applications, the ability to embed code directly inside HTML code makes the program look more elegant because it saves you all the print statements. This is where PHP came along. It was first invoked by Apache over CGI. Each time you make a request to a PHP page, the PHP interpreter has to parse and compile your PHP code from head to toe into bytecode before it can run.

As PHP gains more popularity because of its ease of rapid prototyping, the application size increases. It became impractical to do head to toe parsing each time someone requests a PHP page, which nowadays can easily involve tens of megabytes worth of code. MediaWiki is 44MB. Joomla is 33MB. Wordpress is 9MB. PhpWiki is 1.4MB. This is where mod_php comes along. It avoids having to recompile a massive amount of code by doing it only once, and caching the bytecode in memory. Putting all the bytecode in memory seems to be a small deal, but Apache distributes its workload over several child processes, each of them has an instance of mod_php that caches its own copy of the bytecode. If you increase the number of child processes, you run out of memory. If you limit memory usage, you cannot fully utilize CPU and I/O. This is explained better in Why mod_php is bad for performance. Unfortunately, memory is still the scarce resource. There is a limited amount of memory you can put into each server. You will have to run more servers; they take up more space and cost more money.

Facebook recently developed a version of PHP called HipHop that precompiles your code. This will alleviate the need for mod_php hack, which trades memory usage for performance, a poor tradeoff (maybe Mark Zuckerberg isn't as dumb as I thought).

Now, back to CGI. I did mention that you could write CGI in any language you want. However, some languages have a pretty bad run-time startup cost (Java is particularly notorious due to its Just In Time compilation). A solution is to keep those CGI child processes running, hence the FastCGI protocol is born. This means that your application runs more or less persistently along with the web server. Some languages like Java or Python simply has a web server written in that language, which requires no bridging. FastCGI is just a clumsy way to bridge together a particular server software with dynamic site logic written in a different language, due to administrative reason, e.g. at a web hosting provider. If you can run your own web server, don't bother with FastCGI. For the rest of the article, assume you write a persistent web server from scratch.

The fact that your web application runs persistently on the server has several implications. If you write your application in C or C++, and it has a memory leak, it will keep accumulating memory usage until it runs out of memory and dies. You can limit the effect of memory leak by restarting the server process after a predetermined number of requests, and keep several processes active so that some servers are available to handle requests at any given time.

If you use a garbage collected language like C# or Java, you don't have to worry about memory leak, but multi-threaded request handling is known to interfere with generational garbage collection by prematurely putting would-be garbage in the tenure heap, which causes costly full-collection to run more frequently. There are two solutions, either to increase nursery size, or separate the application into several processes, each running only a few threads. You can mix the two approaches.

This pretty much covers all I want to say. Here is a summary of secrets that I want everyone to know, about running web servers.

Stay away from mod_php (or mod_perl, mod_python, mod_ruby).
Stay away from web hosting providers that insist everyone should use mod_php (or other mod_*s) because they are encouraging their customers to be more expensive, and they are often more pricey. That applies to 99.9% of web hosting companies out there.
If you run your own web server, don't bother with FastCGI. If you can help it, use a web server that can be embedded inside your web application, written in your application's native language.
It still pays to write a web application in C/C++, especially where scalability is concerned.

Oh, and I forgot. If you can use client-side scripting, use it. Each individual user's computer might compare feebly to your server, but collectively their computing power is more than you can rent from a data center.

Tuesday, May 18, 2010

SSH from a jail

In FreeBSD, when you chroot into a jail, you cannot access /dev/tty anymore because the current terminal is not created inside the jail. You'd see something like this:

$ cat /dev/tty
cat: /dev/tty: Device busy

However, that also means you cannot ssh out of the jail because ssh opens the terminal device when asking for password. One way to work around this is to use public key authentication. However, there is a simpler alternative, using an SSH_ASKPASS hack.

The manual page of ssh states that if both DISPLAY and SSH_ASKPASS are set, then the program specified by SSH_ASKPASS will run. The password is printed to standard output of the program. Instead of prompting for a password, we will write a script to echo the password out.

$ cat > ssh-login.sh
#!/bin/bash
echo open-sesame
^D
$ chmod +x ssh-login.sh
$ export DISPLAY=:0.0
$ SSH_ASKPASS=./ssh-login.sh ssh ...

Obviously, SSH will still not be able to access the terminal, so any SSH functionality that involves a pseudo-terminal will not work. You will be able to access non-interactive commands, SFTP, and port forwarding this way.

Wednesday, May 12, 2010

Abbreviating output to a line in the terminal

A very long time ago, I quickly whipped together a Perl script that will reduce the standard output from another program to just a line, prefixed with an animation character, and optionally suffixed with ellipses if the line is too long. It is useful if only the most recent output is significant, and that I only want to know if the program is making progress. It can be applied to many commands, e.g.

tar zxvf source-1.1.0.tar.gz | abbrev
unzip archive.zip | abbrev
make | abbrev
tail -F /var/log/messages | abbrev

I've long wanted to rewrite that Perl script in C, and now here it is (abbrev.c). The program is licensed under GNU General Public License, version 3.

Compiling is simple if you're using GNU make:

make abbrev

You don't need to supply a Makefile. The implicit rule will work. You can optionally supply CFLAGS=-O3 to the make command, as well as optionally strip the executable after make. To install, just move the executable to somewhere in your PATH. It should compile out of the box on Linux and Mac OS X.

There are some subtle features that one might not notice:

Output is throttled to a certain frame rate (default 30 fps).
Resizing the terminal window will cause the line length to adjust accordingly.

Hope you enjoy it.

Tuesday, May 11, 2010

Cost estimate for large project

A programming project of mine is growing in the number of lines of code. Although I write documentation as inline comments, I'm at a point where it is nice to have a reference. I've always known that open source projects tend to use Doxygen, which automatically parses your code and extracts documentation from it to be used as a reference. But this is not what I'm going to talk about today, though I'm using Doxygen as an example.

As I read about Doxygen, I noticed that the author offers some form of commercial support: implementing features, fixing bugs, and providing help to answer questions. I think it would be more useful if he could offer the service to retrofit a Doxygen configuration file for an existing project. In doing this, he would learn about large project requirements that he may build into Doxygen in the future. Another reason is that some people might not want to divert their attention to figure out how to configure Doxygen for their project, and they're willing to pay for the service (this is another plausible open source business model).

Most projects have a particular convention, so that setting up Doxygen is only a one-time effort. The nice thing about a computer is that, once the initial setup is done, the cost of recreating the reference from source code is miniscule, and it is a cost that most project teams should be able to absorb. The expensive part is the setup, which requires human attention. This applies to any automation for a project, so I'm going to stop talking about Doxygen from now on.

Not all projects are created equal. It is simple to setup automation for a project that spans only a few thousand lines of code. A large project with millions lines of code has a lot of hidden complexity. One source of the complexity is due to the fact that collaborative effort often has bits of inconsistencies because everyone does things somewhat differently. Even the same person changes the way he/she does things over time. In big-O notation, the amount of inconsistency is O(n) where n is the number of lines of code. In order to resolve these inconsistencies, we need to compare each k inconsistencies to the remaining (k - 1) inconsistencies (so we can devise a workaround or a fix), and the amount of effort required would be O(n²). This is the cost of setting up automation for a project.

It's easy to see that it is the most cost effective to setup automation from when the project is small. In addition, for a large project, the cost estimate would be proportional to the square of the size of the project. A well-maintained project is tree-structured, and (without proof) a divide and conquer strategy could be used to resolve inconsistencies that could hinder automation. The cost would be closer to O(n logn).

I think this very simple asymptotic analysis explains why money spent on multi-million dollar projects tend to vanish. A project involving n workers cannot allow each one of them to collaborate with one another directly, which means the communication complexity alone is O(n²). A ideal, hierarchical, tree-structured organization still requires O(n logn) amount of communication complexity for the time spent on meetings and planning. The cost for the actual work done is still O(n), and the total cost estimate is O(n logn) + O(n) = O(n logn). The cost estimate for a large project never scales linearly, and most directors under-estimate the cost if they use a linear model.

Friday, May 7, 2010

Multi-threaded allocator locking

Suppose we have an allocator that employs a headerless object layout strategy, like Streamflow. The free(p) operation needs a way to tell if p is managed by any of the subheaps, and if so, which subheap manages p. In essence,

void free(void *p) {
  SubHeap *h = GetSubHeap(p);
  if (h)
    h->Free(p);
}

We need a way to quickly lookup the subheap that manages a pointer. The mechanism used in Streamflow is to have a byte vector, each one corresponding to a page in memory denoting whether the subheap is present, and if so, which kind (I imagine subheaps would be in power of two number of pages, naturally aligned to its size). This mechanism is attributed to the Big Bag of Pages (BIBOP) due to Guy Steele, Jr.

On a 32-bit machine with 4KB page size, the byte vector would be 1MB. On a 64-bit machine, it will be a whopping 4096TB unless we use much bigger page size, or that the table would have to be sparse, such as using a multi-level hash table, or using some kind of a binary search tree. Both impose challenge in scalability in a multi-threaded setting because concurrent lookup and update are non-trivial.

Inspired by how Translate and Lookaside Buffer (TLB) works, I postulates that most of the lookup could be localized to the thread's own lookaside table. It is a linearly probed hash table, keyed by the lookup address, with a small probe distance (say, 4). The idea is that if we can't find the entry in four iterations of a search, then we might as well appeal to the global data structure. Hash table insertion may overwrite existing data. The policy to balance probe distance, eviction policy, and performance would be subject to further research.

So we will have GetSubHeap(p) implemented like this:

SubHeap *GetSubHeap(void *p) {
  SubHeap *h;
  if (h = LocalSubHeapLookup(p))
    return h;
  if (h = GlobalSubHeapLookup(p)) {
    LocalSubHeapInsert(p, h);
    return h;
  }
  return NULL;
}

The LocalSubHeapLookup() and LocalSubHeapInsert() operate on thread local data structure, so they do not need any synchronization. GlobalSubHeapLookup(), on the other hand, must acquire a read lock.

What might be of special interest is GlobalSubHeapRemove(). What if a cached entry of subheap is being removed? Ideally, the removal would only happen if the subheap is empty, which means the program should no longer have references to any objects in the subheap. The entry in the lookaside table would be left unused until it is evicted. However, many programs have double-free problems. This would cause free(p') to find a subheap from p' that is already removed, and call SubHeap::Free(p'). Therefore, the precise problem depends on what the function does. Another problem is if the subheap is removed, but another subheap is constructed on that address. The double-free problem would corrupt the subheap, again, depending on how resilient SubHeap::Free(p') handles corrupted pointers (the subheap might have a different object layout this time). Even if we keep the thread-local lookaside table up to date with the global mapping, there may simply be no good way to handle double-free gracefully.