Monday, May 24, 2010

Mac OS X thread local storage

Historically, Mac OS X uses the Mach-O executable format, which does not support thread local storage data section found on most operating systems using ELF executable format (here is how ELF handles thread local storage as documented by Ulrich Drepper). If you try to compile a program using __thread modifier, you'd get the following error.
$ cat tls.c 
static int __thread x;
int main() { return 0; }
$ gcc tls.c 
tls.c:1: error: thread-local storage not supported for this target
However, it has been suggested that pthread_getspecific() has a fast implementation on Mac OS X. The pthreads thread specific functions are POSIX API facility for thread local storage (TLS).

On Mac OS X, pthread_getspecific() for i386 takes exactly three instructions. It is similar to ELF handling of TLS on i386, in the sense that both use the GS segment to point to a thread control block. However, the difference is that ELF only stores the thread pointer in the thread control block, which is then used to calculate the address of the thread local storage data section for that thread. On Mac OS X, pthread specific values are allocated from the thread control block in the GS segment. This saves another indirect reference, so that could be faster.

The way Mac OS X implements pthread_getspecific() for ppc target is more complicated. First, it looks up from the COMMPAGE (defined in cpu_capabilities.h) the address to several possible pthread_getspecific() functions implemented by the mach kernel, and branch there. From there, thread control block is either obtained from SPRG3 (special purpose register #3), or by an ultra-fast trap. The SPRG3 points to thread control block where pthread specifics are allocated from, so the approach is comparable to using GS segment on i386. The ultra-fast trap approach, despite its name, is a slower fall-back approach that appeals to the task scheduler to find out which thread is currently running.

There is no reason why pthread_getspecific() couldn't be implemented as efficiently as compiler's thread local storage variables. However, with __thread, the compiler could cache the address for the thread local variable instance and reuse it in the current function. If several functions are inlined, all of them could use the same pre-computed address. With pthread_getspecific(), the compiler doesn't know that the address to the pthread specific variable can be reused. It has to call the function again whenever the value is asked. For this reason, Mac OS X also provides inlined version of pthread_getspecific() to get rid of the function call overhead, as well as allowing the compiler to pre-compute the offset into the thread control block whenever possible.

I don't know how other operating systems implement pthread_getspecific(), but if they already support thread local storage variables, they should be able to implement pthread specific like this to achieve the efficiency as expected on Mac OS X:
// Assume we only need a finite number of slots.
#define MAX_SLOTS 512

// For storing thread specific values, using thread local storage.
__thread static void *specifics[MAX_SLOTS];

// For storing allocation status for a pthread-specific key.
static struct {
  int created;
  void (*destructor)(void *);
} key_descriptors[MAX_SLOTS];

// For locking the routine that has to manipulate key_descriptors,
// e.g. pthread_key_create() and pthread_key_delete().
pthread_mutex_t key_descriptor_lock;
This is inspired by pthread_tsd.c in Mac OS X Libc. Notice that, unlike their Libc, we can implement pthread_getspecific() and pthread_setspecific() using trivial inline functions or macros that access the specifics array.

Therefore, my recommendation is to use POSIX pthread specific in your program, which will work on any POSIX including Mac OS X. If your operating system has a slow POSIX pthread specific implementation, but the compiler has thread local storage support, then you can roll your fast pthread specific like shown above.

No comments: