Tuesday, January 19, 2010

Firefox Memory Usage

It's been a while since pavlov last visited optimizing memory usage in Firefox, which resulted in a switch to jemalloc. Since then, I decided to take a closer look at memory allocation. Using dtrace to attach to an existing running Firefox process, I traced the arguments for each calls of malloc() and calloc(). The trace file was observed over a day of Firefox usage, resulting in a 1.7GB trace file. The histogram derived from the trace looks like this:


The x axis is the object size, and the y axis is the total number of requests of that size. This is admittedly not very interesting. All it says is that there is a lot of small-sized allocation, but the different sizes are pretty spread out. There is occasionally very large requests.

I decided to aggregate object sizes into size classes in the same way jemalloc and dlmalloc does it: there is one class for each sizes <= 512 bytes in 8 byte increments, and sizes above that are grouped to their power of 2 ceilings. There is a bin for each 0, 8, 16, 24, 32, ..., 504, 512 size classes, and 513--1024, 1025--2048, 2049--4096, ... size classes. The resulting aggregated histogram, plotted with x axis on the log scale, looks like this.


It appears that both allocators would probably perform well for most sizes, since the sizes 512 bytes and under are managed using lists with O(1) insertion and removal. However, the 1024--2048 class may be a bit of a concern. Here is the linear scaled histogram zoomed in to this range:


Although it can't be seen easily from the figure, there are two significant peaks at exact sizes 1040 and 1047 bytes.

Given this information, is jemalloc a good choice? Probably not. That's because jemalloc rounds object sizes up to a power of 2 number for objects between 512 bytes and 1MB, so objects that are 1040 and 1047 bytes will have 2048 bytes allocated for them. That is a large amount of internal fragmentation.

The Doug Lea allocator dlmalloc, on the other hand, does not round sizes up like that. The power of 2 binning for large objects is only for the purpose of locating the object size class. Each size class is a doubly linked list of objects of the same size, with O(1) insertion and removal.

However, this seems to contradict another finding by pavlov, namely that jemalloc gives better fragmentation results than the default glibc malloc, which is based on dlmalloc. The explanation is that dlmalloc can be subject to external fragmentation when it has a mix of small and large objects on the heap. Jemalloc's use of array-based segregated heap for small objects and buddy-system like heap for large objects trades off external fragmentation for internal fragmentation.

I have a hunch that combining dlmalloc and jemalloc strategies might yield further improvement. Namely, for small sizes, use array-based segregated heap like jemalloc. For large sizes, use coalesceable heap like dlmalloc. This avoids the mixing of small and large objects on the same heap, so external fragmentation could be reduced with no impact on internal fragmentation.

No comments: