Friday, February 26, 2010

How Generational Garbage Collector chokes on high request rate

Who doesn't want their server to be able to handle a large amount of requests gracefully? So you studied the programming languages shootout and picked one that performs the best—but not C/C++ because you don't want to bother with memory management—then wrote your server application, carefully multi-threaded your code, and tuned the code to your satisfaction. Only you find that the more concurrent requests you handle, the more time it takes to service each request. You profiled the running time and found that your server spends more time garbage collecting when the number of concurrent requests is higher.

Generational garbage collector assumes that most objects die young, so they would perform quick, minor collection on a nursery (also called "young" or "new object" heap). The assumption is that the nursery would contain few live objects, so the collection would be quick. The survivors are moved to the tenure generation (also called "old" heap), which is garbage collected less frequently. When the tenure heap runs out, the garbage collector engages a full collection. This heuristic addresses the problem that garbage collectors spend a lot of time traversing and moving reachable objects in order to get rid of the abandoned ones.

The lifetime of an object is measured by the number of bytes being allocated, not by the number of elapsed CPU cycles. A minor collection is triggered when the nursery is full. If the nursery is too small, the probability that a short-lived object gets promoted to tenure status increases. This increases the chance that we need a full collection later. We want the nursery to be large enough to give most short-lived objects a chance to die.

The "most objects die young" heuristic works well for single threaded programs. However, in a server application, when the server is handling multiple concurrent requests, the rate of allocation also increases proportionally—by virtue that multiple threads are allocating concurrently, making the object lifetime as seen by one thread of execution seem longer relative to the amount of objects the nursery could hold. The more concurrent requests your server handles, the more short-lived objects make it to tenure, so the tenure heap runs out more quickly. Then the garbage collector must spend more time doing a full collection.

One solution is to increase the nursery size. If a 4MB nursery is sufficient for one request, you should increase the nursery size to 400MB if you want to handle 100 concurrent requests. The tenure generation should not be increased—its size should be proportional to the amount of live objects you plan to keep in memory throughout the lifetime of the server, not to the number of concurrent requests. If your garbage collector gives you the option to specify the size of a per-thread nursery, that would be ideal.

Another solution is to limit the number of concurrent requests per server process. If you want better concurrency, you can always spawn new server processes. To one extreme, you can assign one worker process per request. The processes can all run on different processors, but each process is single threaded. When the request is finished, the process would make itself available for the next request (without quitting and restarting). A middle ground is to have several processors run a low number (< 20) of concurrent threads.

Further reading:

Friday, February 19, 2010

sbrk() is not thread safe

The sbrk() system call is a legacy relic of Unix systems before the era of virtual memory, for the purpose that a process may expand or shrink its heap memory in LIFO order. A typical programmer would never use this system call, but it is used to implement malloc() and free(). Modern OSes provide mmap() and munmap() to allocate and free virtual memory, but they still support sbrk() that is also backed by virtual memory. Many memory allocators, most notably the Doug Lea allocator, still prefer sbrk() when possible. When asked the question "why use sbrk() when there is mmap()," the response is often "why not?" We shall settle the account by answering why it is bad to use sbrk().

Out of the allocators that use sbrk(), there are two kinds: ones that recognize the non-contiguity of the sbrk() heap, and ones that do not. The sbrk() heap could become discontinuous because a mmap()'d region or a shared object obstructs the growth of the heap. We'll not pay too much attention to this here, but to focus on the other reason, namely that another part of the program could use sbrk() to request more memory, bypassing malloc(). It would be the result of a performance-critical part of the program implementing its own allocation strategy.

Suppose we write an allocator that, being a good citizen, would try to return memory back to the operating system whenever possible. Since sbrk() can only grow or shrink memory in LIFO order, it would only call sbrk() when it wants to free the last free memory block in the heap that it manages. Obviously, if we allow another part of the program to bypass our heap management and use sbrk() directly, we need to be careful that we can release our free memory block only when it is the top of the heap.

The situation becomes more complicated in multi-threaded programs. A good allocator would use locks and scalable data structure to manage the heap, but it eventually has to acquire or release memory using sbrk(). We can assume that sbrk() system call is atomic, but we will show that sbrk() is still not thread-safe.

Let's say sbrk(0) indicates that the top of the heap currently lies at the end of our free block that we want to release, so the next step would be to call sbrk() with a negative number to shrink the heap. But before we get a chance to do that, we're preempted by another thread that calls sbrk() to expand the heap. Once the other thread is done, we proceed to shrink the heap by the amount we previously determined. We violated the LIFO order, and ended up shrinking the memory the other thread requested.

The problem is that atomicity of sbrk() does not guarantee the consistency of the state of the heap. There is, however, a way to augment the system call with another argument that will make it thread-safe. The sbrk() assumes the following prototype:
void *sbrk(intptr_t increment);
We can regard sbrk() as a function that adds an integer by an increment atomically. Inspired by how compare-and-swap works, we shall modify the function as follows:
void *sbrk_safe(intptr_t increment, void *expect_top);
The new function will only cause the heap to shrink or grow if the expect_top pointer, as seen by the caller of sbrk_safe(), is still the same as the actual heap top. If not, an error value is returned; the caller should refresh its own snapshot of the heap state, and then decide if it wants to try again. This allows each caller of sbrk_safe() to maintain transactional integrity.

We showed that sbrk() in its current form is not thread-safe even if it is atomic. We showed that it is possible to fix by proposing a change of the function definition. However, the ultimate verdict is that it is time to stop using sbrk() altogether. We should all be using mmap() instead.

Update (Feb 22): apparently phkmalloc uses sbrk() as its primary way of requesting memory, and its successor jemalloc still allows the use of sbrk() through MALLOC_DSS option. Jemalloc tries to protect sbrk() with a mutex, acknowledging that it does not protect against third-party use of sbrk().

tcp_wrappers hostfile_match

TCP Wrapper, written by Wietse Venema known for his work Postfix, is used by many service daemons on Unix to allow or deny access based on the host that initiates the connection to the service. You specify the client origin to be matched by inverse domain name lookup, by its IP address, or by its netgroup membership. It is present on many unices, such as Linux, BSD, as well as Mac OS X. Furthermore, on some unices such as FreeBSD, NetBSD, and Red Hat Linux, tcp wrapper supports specifying a file in place of the client match pattern list. This allows you to put a list of client matches in a separate file, avoiding clutter in the host access control list.

Here is a little known secret. The official tcp_wrappers_7.6.tar.gz---the latest, which had not been updated since 1997---did not support this feature. The feature apparently first appeared in FreeBSD. However, most prominently, neither Mac OS X nor OpenBSD support it. Most distributions have their own patch set to add modern features to tcp wrapper, such as IPv6 support and various bug fixes, but these patch sets never made it to upstream. IPv6 support was introduced in a fork created by Casper Dik, endorsed by Wietse but packaged separately. This fork does not have the file match feature.

To tell if your system supports the file match feature, find the tcp wrapper source file for your distribution, and look for the function hostfile_match in hosts_access.c. You might find it easier to check your man 5 hosts_access and see if it mentions "a string that begins with a '/' character is treated as a file name."

Update (9/26/2012): according to opensource.apple.com, tcp_wrappers is present up to Mac OS X 10.7.4 but appears to be missing from 10.8. Also, the proper syntax for IPv6 address/netmask is [fe80::/10], not [fe80::]/10.

Tuesday, February 9, 2010

Assembling PDF from full-page scans using LaTeX

On Friday, I requested a journal article from the library, which doesn't have subscription to an electronic copy of the journal prior to 1991, but has a hard copy in storage. Over the weekend, they fished it out and gave me a photocopy. Unfortunately, some text near the book binding were not legible in the photocopy because the margin was too narrow. They let me check out the book for a day and figure out what I want to do with it.

I wanted to have a PDF copy, so this is what I tried:
  • I tried using a flat-bed scanner with Adobe Acrobat, but the text near the binding showed up black.
  • I tried using BookEye scanner, which us essentially a lamp and a digital camera. I could only photograph both pages in a single image since the image splitting function in the scanner simply chopped off the text near the binding. The page still showed up curved. Using a small piece of acrylics plastic, I was able to flatten one page at a time, so I photographed the two-page spread twice, once with the left page flattened, and once with the right page flattened. I was able to recover the missing text close to the binding, but the PDF has duplicate pages, half of each page should be discarded.
  • I tried using a Canon photocopier to scan for me. This produced the highest quality scan, but I was not able to flatten the page enough to recover text from the book binding.
So I took the scanned PDF from the Canon photocopy machine and the PDF from BookEye, extracted the images, and then used the BookEye images to patch the missing parts of the Canon scan. I also cleaned up the images. The result is several PNG files. Now I want PDF.

It turns out it is easy with pdfLaTeX. I made a .tex file like this:

\documentclass[letterpaper]{article}
\usepackage[left=0pt,top=0pt,right=0pt,bottom=0pt]{geometry}
\usepackage{graphicx}
\begin{document}
\newpage
\includegraphics[width=\paperwidth,height=\paperheight]{000.png}
\newpage
\includegraphics[width=\paperwidth,height=\paperheight]{001.png}
\newpage
\includegraphics[width=\paperwidth,height=\paperheight]{002.png}
\end{document}

And ran pdfLaTeX on it to generate the PDF. The main idea is to set page size to letter, set page margin to 0 , then include the image files while setting the width and height to that of the paper using LaTeX measurement macros. I think there are still rough corners of this approach because LaTeX complains about overfull hboxes, but the resulting PDF is usable for my need.