Sunday, November 14, 2010

Personal notes on selected ISMM 2010 papers

For Optimizations in a private nursery-based garbage collector,
  • Existing parallel generational garbage collector uses per-thread nursery, but after a minor collection, the per-thread nurseries may be allocated to another thread, causing massive false sharing. This is revealed by VTune.
  • This behavior is especially problematic because nursery cycles through quickly when the program allocates a lot of objects that quickly become garbage.
For Efficient memory shadowing for 64-bit architectures,
  • Shadow memory is used to store information about an application address, typically used by a memory analysis tool like Valgrind to answer questions like: is it allocated; is it owned by a particular thread; is it locked; etc.
  • Direct mapping stores information also in the application's address space. The information is stored at address * scale + displacement. Only works if the address space doesn't have holes.
  • Segmented mapping is a special case of direct mapping. The shadow memory is only big enough to store an indirect pointer where information can be retrieved. For 64-bit address, may use several levels of indirect tables, like how page table is organized. Indirection causes the segmented mapping to be several times slower than direct mapping.
  • The paper proposes an approach that allows several possible displacements (not too many). The displacement is chosen to guarantee no false hits. Addresses range that may cause false hits are reserved.
  • This technique may possibly be used by memory allocators as well. Increasing number of allocators have built-in memory analysis.
For Memory, an elusive abstraction,
  • On a multi-processor system, the ordering of memory read and writes, as a side-effect of processor design, may become an observable artifact by a program running on another processor. Cache coherence protocol can be designed to hide that artifact with slow-down, but Lamport observes that some applications can trade performance with lack of consistency.
  • Relaxed consistency needs precise documentation in order to reason about correctness of concurrent programs. However, processor vendors want to keep the specification vague.
  • My own observation: why not assume no consistency? In addition to memory barriers, the instruction set should provide a "write-through if cache-line is shared" instruction (if cache line is not shared, obviously we don't have to worry about consistency).
For The locality of concurrent write barriers,
  • Although I prefer applications that forego garbage collection for performance and predictability, this paper has a nice description of how concurrent garbage collection works, and a comparison of variations.

Thursday, November 11, 2010

Continuation passing style calling convention for a C-like language

There have been two strong indications where continuation passing style calling convention is necessary for a C-like language.
  • Although user-space thread library exists, it still could not truly implement lightweight processes. The problem is not about kernel-level context switching versus user-level context switching anymore, since both could be equally efficient. The problem is the stack requirement of any thread. Even user-level threads must reserve a large amount of space space (2MB), making it impossible to start thousands of threads unless you use 64-bit address space. The problem with 64-bit address space is that it doubles the storage requirement for pointers.
  • Parallel processing framework such as Cilk goes through great lengths trying to imitate continuation passing style, using Duff's device in generated C code so that execution could "resume" in the middle of a function. This complicates debugging as well as the scheduler of the runtime system.
Inspired by LLVM calling conventions and suggestions for continuation passing style and tail-call optimization, I propose a continuation passing style (CPS) calling convention for a C-like language, where:
  • A code block may be expressed like a function that returns void *.
  • All arguments are passed by register. Additional arguments must be spilled on a memory location, and the last register will be a pointer to the spilled location. The number of registers is machine-dependent, but at least two arguments are allowed.
  • Variadic argument must be spilled in memory as a variable length array.
  • One register is reserved for stack pointer, which must be restored to its original value when the control flow leaves the code block.
  • The stack is useful for calling a conventional C function, or for spilling block-local variables. Scope-local variables that span multiple blocks must be spilled to a heap-allocated location.
  • A code block transitions into another code block using a tail-call with a return statement.
  • It is possible to return from a continuation passing style code block. Stack pointer is restored, a void * value is prepared as the return value, and the function returns like a conventional C function (e.g. pops return address from stack, or branches to return address stored in a register, depending on the machine's native calling convention).
  • Only the stack register (and on some architecture, also the return address register) is callee saved. The caller must save all other registers.
Here is a matrix of caller-callee interaction. There are three types of caller: CPS tail-call, CPS non tail-call, and C. There are two types of callee: CPS and C.
  • Caller is CPS tail-call, callee is CPS: arguments are passed in registers, then branches to callee.
  • Caller is CPS tail-call, callee is C: behaves like a regular tail call from a C caller.
  • Caller is C, callee is CPS: all caller registers are saved, arguments are passed in registers, and a call is made as if the CPS callee is a C callee taking no arguments (i.e. return address may be stored on the stack or in a register, depending on the machine architecture).
  • Caller is CPS non tail-call, callee is CPS: a warning may be issued, but otherwise behaves as if the caller is C and callee is CPS.
  • Caller is CPS non tail-call, callee is C; or caller is C, and callee is C: behaves like regular C function call.
LLVM requires both caller and callee use the same calling convention. But unlike LLVM, we anticipate that C function will want to call CPS, and vice versa. It is intentional that CPS calling convention must be a subset of C calling convention.

The language will be used as an intermediate language for compiling higher level language such as ML, Cilk, etc. A typical function in the higher level source language will be broken down to more than one CPS code blocks.

For a source language that wants to use CPS for lightweight processes, it could be necessary for the lightweight processes to reuse the stack of the worker thread, and that context switching must require the stack to be pristine. The lightweight process may only yield to the scheduler when the stack is empty. This condition may be checked in the run-time, or enforced statically at compile time by having a "pristine" attribute (stack is pristine). A pristine CPS callee may only be called by a pristine CPS tail-call caller.