Thursday, November 11, 2010

Continuation passing style calling convention for a C-like language

There have been two strong indications where continuation passing style calling convention is necessary for a C-like language.
  • Although user-space thread library exists, it still could not truly implement lightweight processes. The problem is not about kernel-level context switching versus user-level context switching anymore, since both could be equally efficient. The problem is the stack requirement of any thread. Even user-level threads must reserve a large amount of space space (2MB), making it impossible to start thousands of threads unless you use 64-bit address space. The problem with 64-bit address space is that it doubles the storage requirement for pointers.
  • Parallel processing framework such as Cilk goes through great lengths trying to imitate continuation passing style, using Duff's device in generated C code so that execution could "resume" in the middle of a function. This complicates debugging as well as the scheduler of the runtime system.
Inspired by LLVM calling conventions and suggestions for continuation passing style and tail-call optimization, I propose a continuation passing style (CPS) calling convention for a C-like language, where:
  • A code block may be expressed like a function that returns void *.
  • All arguments are passed by register. Additional arguments must be spilled on a memory location, and the last register will be a pointer to the spilled location. The number of registers is machine-dependent, but at least two arguments are allowed.
  • Variadic argument must be spilled in memory as a variable length array.
  • One register is reserved for stack pointer, which must be restored to its original value when the control flow leaves the code block.
  • The stack is useful for calling a conventional C function, or for spilling block-local variables. Scope-local variables that span multiple blocks must be spilled to a heap-allocated location.
  • A code block transitions into another code block using a tail-call with a return statement.
  • It is possible to return from a continuation passing style code block. Stack pointer is restored, a void * value is prepared as the return value, and the function returns like a conventional C function (e.g. pops return address from stack, or branches to return address stored in a register, depending on the machine's native calling convention).
  • Only the stack register (and on some architecture, also the return address register) is callee saved. The caller must save all other registers.
Here is a matrix of caller-callee interaction. There are three types of caller: CPS tail-call, CPS non tail-call, and C. There are two types of callee: CPS and C.
  • Caller is CPS tail-call, callee is CPS: arguments are passed in registers, then branches to callee.
  • Caller is CPS tail-call, callee is C: behaves like a regular tail call from a C caller.
  • Caller is C, callee is CPS: all caller registers are saved, arguments are passed in registers, and a call is made as if the CPS callee is a C callee taking no arguments (i.e. return address may be stored on the stack or in a register, depending on the machine architecture).
  • Caller is CPS non tail-call, callee is CPS: a warning may be issued, but otherwise behaves as if the caller is C and callee is CPS.
  • Caller is CPS non tail-call, callee is C; or caller is C, and callee is C: behaves like regular C function call.
LLVM requires both caller and callee use the same calling convention. But unlike LLVM, we anticipate that C function will want to call CPS, and vice versa. It is intentional that CPS calling convention must be a subset of C calling convention.

The language will be used as an intermediate language for compiling higher level language such as ML, Cilk, etc. A typical function in the higher level source language will be broken down to more than one CPS code blocks.

For a source language that wants to use CPS for lightweight processes, it could be necessary for the lightweight processes to reuse the stack of the worker thread, and that context switching must require the stack to be pristine. The lightweight process may only yield to the scheduler when the stack is empty. This condition may be checked in the run-time, or enforced statically at compile time by having a "pristine" attribute (stack is pristine). A pristine CPS callee may only be called by a pristine CPS tail-call caller.

1 comment:

Likai Liu said...

Another reason to avoid threads with stack is that stack allocation and deallocation are expensive memory mapping operations. The deallocation is especially costly because unmapping memory requires TLB shootdown, which incurs synchronization overhead. This can be partly remedied by using thread pool, but the problem with thread pool is that tuning the thread pool size becomes voodoo magic. It highly depends on the code the thread runs as well as the hardware.