Thursday, January 8, 2009

Uses of hugetlb

While the kernel hackers busily working out what are the reasonable performance expectation of hugetlb in the Linux kernel for allowing application to use large memory pages (>>4KB, typically ~4MB) offered by Page Size Extension, there are a few consequences for using large pages that restrict applicable uses of it. Using a large page improves performance by significantly reduces TLB misses, but the sheer size and alignment requirement of the page is the cause for concern. Assuming a large page is 4MB in size, a 4MB page has to be aligned to 4MB in the physical memory and has to be continuous, and the OS mixing 4KB and 4MB pages might run out of continuous physical memory due to fragmentation caused by 4KB pages.

An application could use large pages as a general-purpose heap, but it should avoid fork(). There are currently two ways to allocate large pages: using mmap(2) to map a file opened in hugetlbfs (on Linux only), or shmget(2) passing a special flag (SHM_HUGETLB on Linux, SHM_LGPAGE on AIX, noting that on AIX a large page is 16MB).
  • Using shmget(2), the shared memory backed by large page cannot be made copy-on-write, so both parent and child processes after fork() now share the same heap. This will cause unexpected race condition. Furthermore, both processes could create a new heap space which will be private, and references to the new heap will cause memory error in the other process. References to memory newly allocated in another private heap such as malloc(2) will also be invalid in the other process.
  • While mmap(2) allows you to set MAP_PRIVATE to trigger copy-on-write, the copying is going to be expensive for two reasons. The most obvious one is the cost of copying 4MB data even if the child only modifies a word of it and proceeds with an exec(2). With 4KB memory pages, copying a page on demand is much less expensive and can be easily amortized across memory access over process run-time. The other reason is that Linux might run out of hugetlb pool space and has to assemble physically continuous blocks of memory in order to back the new large page mapping. This is due to the way Page Size Extension works. Furthermore, the OS might also require the swap space backing a large page to be continuous, needing to move blocks on a disk, causing lots of thrashing activity. This is much more expensive than simple copy-on-write.
A large page could also be used as a read-only memory store for caching large amounts of static data (e.g. for web servers), so it doesn't have to be made copy-on-write. It could be made read-write but shared if memory access is synchronized across processes, for example by using SVr4 semget(2) or by using lock-free wait-free data structures. This is appropriate for database servers.

No comments: