Sunday, November 14, 2010

Personal notes on selected ISMM 2010 papers

For Optimizations in a private nursery-based garbage collector,
  • Existing parallel generational garbage collector uses per-thread nursery, but after a minor collection, the per-thread nurseries may be allocated to another thread, causing massive false sharing. This is revealed by VTune.
  • This behavior is especially problematic because nursery cycles through quickly when the program allocates a lot of objects that quickly become garbage.
For Efficient memory shadowing for 64-bit architectures,
  • Shadow memory is used to store information about an application address, typically used by a memory analysis tool like Valgrind to answer questions like: is it allocated; is it owned by a particular thread; is it locked; etc.
  • Direct mapping stores information also in the application's address space. The information is stored at address * scale + displacement. Only works if the address space doesn't have holes.
  • Segmented mapping is a special case of direct mapping. The shadow memory is only big enough to store an indirect pointer where information can be retrieved. For 64-bit address, may use several levels of indirect tables, like how page table is organized. Indirection causes the segmented mapping to be several times slower than direct mapping.
  • The paper proposes an approach that allows several possible displacements (not too many). The displacement is chosen to guarantee no false hits. Addresses range that may cause false hits are reserved.
  • This technique may possibly be used by memory allocators as well. Increasing number of allocators have built-in memory analysis.
For Memory, an elusive abstraction,
  • On a multi-processor system, the ordering of memory read and writes, as a side-effect of processor design, may become an observable artifact by a program running on another processor. Cache coherence protocol can be designed to hide that artifact with slow-down, but Lamport observes that some applications can trade performance with lack of consistency.
  • Relaxed consistency needs precise documentation in order to reason about correctness of concurrent programs. However, processor vendors want to keep the specification vague.
  • My own observation: why not assume no consistency? In addition to memory barriers, the instruction set should provide a "write-through if cache-line is shared" instruction (if cache line is not shared, obviously we don't have to worry about consistency).
For The locality of concurrent write barriers,
  • Although I prefer applications that forego garbage collection for performance and predictability, this paper has a nice description of how concurrent garbage collection works, and a comparison of variations.

No comments: