Thursday, November 29, 2012

MISC: binary Galois Field multiplication

After reading NIST may not have you in mind, I decided that a new ISA should support binary Galois Field multiplication. Binary Galois Field is a special case of binary polynomial \( \mathbb{Z}_2[x] \) where addition and subtraction are already supported as XOR, written as \( \oplus \). What is missing for GF is multiplication and division in \( \mathbb{Z}_2[x] \). The binary polynomials are expressed as bit vectors where the \(n\)-th bit is the coefficient of the \(x^n\) term. Let \( \otimes \) be polynomial multiplication.
  • pmul: \( (a, b) \to (r, s) \) where \( r s = a \otimes b \), i.e. \( r \) is the upper word of the result, and \( s \) is the lower word.
  • pdiv.mod: \( (a, b) \to (q, r) \) where \( q \otimes b = a \oplus r \), i.e. \( q \) is the quotient and \( r \) is the remainder.
It's been a while since I last studied abstract algebra, so I might not have gotten all my terminology correct, but you get the idea.

Friday, November 23, 2012

MISC: a high-level comparison to TTA

There is this old saying that there is nothing new under the sun. While I thought I invented something new, it came to my attention the Transport Triggered Architecture (TTA) which I did not know before. I found some papers as early as 1994, by Henk Corporaal.

Both MISC and TTA specify an instruction set architecture where each ALU has associated input and output ports, and writing to the input ports trigger a computation.

MISC falls short describing how the instruction set would actually be implemented in hardware. It assumes that there is a hardware scheduler that can detect when the output port is ready and advances the pipeline. The compiler only needs to worry about transforming the program into a set of pipelines. This abstracts out the timing details such as gate length out of the instruction set architecture.

TTA programs consist of cycle-stepped move instructions that has to take gate length into account, due to the lack of a global control logic. If an operation requires two inputs—one is called the operand and the other is called a tigger—the operand must be stored before or during the same cycle as the trigger. If the program moves a value out of the result too early, it would read out a corrupt value. This means that the compiler has to take gate length into account while scheduling instructions. If a function unit takes too long (e.g. cache misses) to complete an operation, it must lock the data transport bus so additional move would not take place.

In contrast, MISC exposes no cycle-sensitive timing in the pipeline. The monadic design of the functional units allows pipelines to synchronize by data dependency graph, not by timing. Hence there is no need to globally lock the pipeline. I mentioned one instruction to wait for all moves in the instruction queue to finish, but I only added there for the purpose of context switching. I still need to elaborate more on how I plan to support preemptive multitasking.

TTA already has a hardware implementation. It is not clear if the idealized scheduler for MISC is implementable in hardware, or if it turns out to be the performance bottleneck.

Saturday, November 10, 2012

MISC: Computation Modules

This is a continuation of Microcosmic Instruction Set Computer, defining the computation modules before I decide on the port assignments. The notation is \( (I_1, I_2, ..., I_m) \to (O_1, O_2, ..., O_n) \) for input ports \( I_1 ... I_m \) and output ports \( O_1 ... O_n \). The total number of input and output ports for each operator are kept as power of 2 to allow less fragmentation in port assignment later.

Arithmetic operators

  • add: \( (a, b) \to (r, c) \) where \( r = a + b \) and \( c \) is the carry bit.
  • sub: \( (a, b) \to (r, c) \) where \( r = a - b \) and \( c \) is the carry bit.
  • neg: \( a \to b \) where \( b = -a \).
  • mul: \( (a, b) \to (r, s) \) where \( r s = a \times b \), i.e. \( r \) is the upper word of the result, and \( s \) is the lower word.
  • div.mod: \( (a, b) \to (q, r) \) where \( q \times b = (a - r) \), i.e. \( q \) is the quotient and \( r \) is the remainder.
  • smul, sdiv.mod: same as mul and div.mod but for signed integers.
  • and: \( (a, b) \to (c, c') \) where \( c = a \wedge b \) and \( c' = a \wedge \neg b \), both bitwise.
  • or.xor: \( (a, b) \to (c, c') \) where \( c = a \vee b \) and \( c' = a \oplus b \), both bitwise.
  • rot.left: \( (a, c) \to (r, s) \) where \( r \) is the result of shifting \( a \) to the left by \( c \) bits, and \( s \) is the overflow part of \( a \) adjacent to the least significant bit.
  • rot.right: \( (a, c) \to (r, s) \) where \( r \) is the result of shifting \( a \) to the right by \( c \) bits, and \( s \) is the underflow part of \( a \) adjacent to the most significant bit.
  • not: \( a \to b \) where \( b = \neg a \), bitwise.
Signed integers use two's complement representation, so they don't need to distinguish addition and subtraction. Note that the conditional branch instruction should be able to use the condition from the least significant bit which is the carry bit, as well as the most significant bit which is the signed bit. This allows comparing both signed and unsigned integers.

Convenience operators

  • lea: \( (b, a, x) \to r \) computes \( r = 2^a \cdot x + b \).
  • add3: \( (a, b, c) \to r \) computes \( r = a + b + c \).
  • or3: \( (a, b, c) \to r \) computes \( r = a \vee b \vee c \).
  • xor3: \( (a, b, c) \to r \) computes \( r = a \oplus b \oplus c \).
Any of these operators—especially “or” or “xor”—could be used for the purpose of joining the completion of all computations. These are not to be confused with non-deterministic join. A non-deterministic join is not necessary; just have two pipes connecting from some two different output ports to the same input port.

Conditional operators

  • cond: \( (c, a, b) \to r \) where \( r = c \mathop{?}  a : b \), i.e. returns \( a \) if \( c \) is nonzero, otherwise \( b \).
  • cx: \( x \to c \) where \( c \) extends the carry bit of \( x \) to fill the whole word.
  • sx: \( x \to s \) where \( s \) extends the signed bit of \( x \) to fill the whole word.

Memory controller

  • read: \( a \to r \) where \( r \) is the value of memory word at address \( a \) which must be word-aligned.
  • write: \( (a, x, z) \to z' \) to write \( x \) to the memory word at address \( a \) which must be word-aligned, whenever \( z \) is also ready. The value of \( z \) is not used. When the write finishes, \( z' \) becomes ready with a zero value.
  • cswap: \( (a, new, old) \to old' \) writes \( new \) to the memory word at address \( a \) only when \( a \) still contains the expected \( old \) value. The actual value of \( a \) before the swap is returned as \( old' \).

Byte packing

  • pack: \( (a, b, p) \to r \) where \( r \) is the result of permuting the concatenation of \( a b \) according to mapping \( p \).
  • unpack: \( (a, p) \to (r, s) \) where \( r \) and \( s \) are the two words selected from \( a \) by mapping \( p \) and they can be optionally sign extended.

Coprocessor I/O

  • exec: \( (k, a_1, a_2) \to r \) sends a two-argument command to co-processor \( k \) and returns its result in \( r \).
  • exec2: \( (k, a_1, a_2, a_3, a_4, a_5) \to (r_1, r_2) \) sends a five-argument command to co-processor \( k \) and returns a pair of results in \( r_1 \) and \( r_2 \).
This can be used to drive the floating point coprocessor as well. The floating point coprocessor, for example, would occupy a range of k in the coprocessor namespace, one for each operation.

Some concluding remark

The fact that all computation modules above have non-empty inputs and outputs means that this forms a monadic set of operators, where synchronization is accomplished by port pipelining. I guess you can call this Monadic Instruction Set Computer also.

Because of the fine granularity of some of the computation modules, it makes sense to base the cycle on the finest granule which is the gate distance, e.g. the bitwise operators are all single gate distance, hence single cycle. This means that while some modules are single cycle, others can be hundreds or thousands of cycles.

Friday, November 9, 2012

Microcosmic Instruction Set Computer

When it comes to instruction set architecture, there are many philosophies, ranging from CISC to RISC down to extremes like OISC or ZISC. The dominent is still Intel x86 or x86-64 which is CISC, but ARM is getting popular too which is RISC. I've not seen any commercial product based on OISC or ZISC so they are probably not practical.

Having had some experience in Internet planetary scale distributed computing where remote procedure calls are made between computer services, coming back to looking at instruction set design gives me the revelation that even a single-core microprocessor is itself a distributed system. Rather than a distributed system of disk storage, memcache, and web servers, the microprocessor is a distributed system of arithmetic logic units, memory controller, and I/O buses. This gives rise to the idea of a microcosmic instruction set computer where the instruction set takes care of basic register file and control flow, but offloads all computing and memory I/O activities to the microcosm of distributed services on the processor die.

The instruction set features:

  • Some (yet-to-be specified) instructions for unconditional and conditional branching, namely to affect the instruction fetching.
  • A small I/O address space (~16K?) each specifies a word-sized port, which are buffered memory that can be written to and read from. Each port also has a “ready” bit to signal the availability of data, for the purpose of instruction scheduling.
  • Lower tiered I/O ports (0-15) are simply buffered general purpose register file. Middle tier ports (16-255) are for multiple ALUs, memory controllers, and external I/O controllers. Upper tier ports (256-?) are laid out in groups of 256 like the lower and middle tier ports (0-255) to enable additional instruction level parallelism.
  • An instruction specifies a move from one source port to a destination port. The instruction is only executed when data for that port is ready, but multiple instructions can be queued by the instruction scheduler. The move instruction can be seen as “connecting” a pair of ports.
  • An instruction to write a small constant value directly to a port.
  • One instruction to wait for all moves in the instruction queue to finish.
All ports can be used like a register (i.e. they are buffered), but some ports are used for inputs and some are used for outputs. For example, an adder for \( i + j = k \) would occupy three ports, \( p_i \), \( p_j \), and \( p_k \). The adder begins working whenever the ready bits of \( p_i \) and \( p_j \) are set, and the ready bit of \( p_k \) is only asserted when the result is available. The adder can be powered off when it's not doing work.

The instruction scheduler could dynamically map port numbers in groups of 256 if it wishes to turn off additional die area to reduce power use even further.

Even within a single group, ports of the same functionality may indeed be a queue to a smaller number of units. For example, the port assignment might give \( 8 \times 3 = 24 \) ports to an adder, but a particular chip might only have 2 physical adders doing the work of 8 logical ones. It is particularly useful to have multiple logical units of memory controller to allow memory I/O to be queued.

To be continued...

Friday, September 14, 2012

Design of a Logging System

Here is design of a logging system that provides a convenient way to trace the execution of a long running program with arbitrary format strings and data, serialized to a compact binary file.

It needs to store the following information per emitted data record:
  • executable image magic which identifies the logdef table used.
  • __FILE__ (via logdef table)
  • __LINE__
  • backtrace() (an array of machine words, looked up via logdef table)
  • pthread_self()
  • log level
  • current time from gettimeofday()
  • format string (printf formatted, via logdef table)
  • format string arguments (all types accepted by printf, no logdef lookup)
The log will contain the following record types:
  • Preamble: stores the full path to the executable image and its content fingerprint (size + {md5 or sha1}) so that corrupted logdefs can be reconstructed. Also stores the magic derived from the fingerprint.
  • String logdef: a "log definition" for magic × uintptr_t -> string, for __FILE__ and format string lookup.
  • Backtrace logdef: a "log definition" for magic × uintptr_t -> {dli_fname: string, dli_fbase: uintptr_t, dli_sname: string, dli_saddr: uintptr_t} as returned by dladdr(), for backtrace lookup.
  • Data record, as described above.
Log files may be arbitrarily concatenated. In that case, the logdef table id which is the same as the executable image fingerprint magic will allow each data record to be interpreted using a different set of logdefs. The preambles may be corrupted also. If both the preamble and logdefs are corrupted, a data record may have missing interpretations.

Logdefs are emitted lazily. The logging system in process maintains a smaller lookup table that tracks whether a logdef has already been emitted. The logdef should immediately precede the data record that references it.

The log serialization format is based on msgpack. A log file consists of frames. Each frame is the sequence:
  • 0xc1 | crc32 | 0xc1 | raw_record
Note that 0xc1 is reserved in msgpack, the value between nil (0xc0) and boolean false (0xc2).

The crc32 is the CRC32 for the bytes inside raw_record, as msgpack object uint32_t (0xce).  A raw_record is a msgpack raw object, either fixraw (0xa0—0xbf), raw16 (0xda) or raw32 (0xdb) that frames a record. This structure, along with the reserved 0xc1, can be used to re-sync to frames in case the log file is corrupted.

Each record is simply two msgpack objects concatenated.
  • tag | payload
Tag is a simple positive fixnum indicating the record type, preamble (0), string logdef (1), backtrace logdef (2), data (127). The payload depends on the record type. Payloads are maps (fixmap, map16, map32) with field_tag to field_value mapping. Fields are specific to record type, but the payload being a map allows a field to be optional, in a similar vein to protocol buffer messages.

Thursday, September 6, 2012

Apple Mac OS X keyboard viewer got stuck maximized

My mom likes pushing buttons, but today she pushed the wrong one. She often uses keyboard viewer to assist her with input methods because she couldn't remember the key mappings. Once the keyboard viewer windows is shown, you can click on the green button to maximize the keyboard viewer, and clicking on the green button again should restore its original size. However, if you drag the keyboard viewer in its maximized state just slightly off the screen, then the green button no longer restores the keyboard viewer back to its original size. The maximized keyboard viewer looks like this and occupies half of the screen.
This happens on Mac OS X Lion (10.7). Not sure if this happens with other versions also.

The solution is to first close the keyboard viewer, then use Terminal (under Applications/Utilities) and enter the following command at the prompt:
defaults delete
The keyboard viewer should be restored to its original state next time you open it.

Another issue my mom was having was that when she tried to enter Chinese, certain characters cause programs to crash (also known as 地雷字, di lei zi, meaning “landmine characters”). It turns out the culprit is the "Use symbol and text substitution" option. It can be turned off under Preferences, Language & Text, Text. Text substitution erroneously matches some input sequences prematurely from the input method and causes invalid characters to be entered into the program.

Monday, September 3, 2012

Mistake in my RC4 implementation

Jim Wilcoxson approached me two days ago regarding the skewed distribution I observed in RC4. He sent me a snippet from SQLite's implementation of RC4 and showed that it did not suffer the non-uniform result I saw.

After digging out my old code, I found two mistakes I made. One in the key scheduling algorithm:
  void initialize(const char *begin, const char *end) throw() {
    for (size_t i = 0; i < 256; ++i)
      s_[i] = i;

    const char *key = begin;
    uint8_t j = 0;

    for (size_t i = 0; i < 256; ++i, ++key) {
      if (key == end) key = begin;
      j = j + s_[i] + *key;
      std::swap(s_[i], s_[j]);
In the original code, I had forgotten to ++key, so the key scheduling is always done on the first byte of the key only. The addition for the fix is highlighted in yellow.

Another mistake is an index out of bounds error in the generation of bytes.
  uint8_t next() throw() {
    i_ = i_ + 1;
    j_ = j_ + s_[i_];
    std::swap(s_[i_], s_[j_]);
    return s_[s_[i_] + s_[j_]];
There is an array index out of bounds error in the return line. The reason is that in C/C++ compiler, the intermediate result of an expression, if it's an int, does not inherit the int size of the sub-expressions, but instead is word sized. Since both s_[i_] and s_[j_] are (0...255), half of the times the result is in (256...511) which accesses some memory beyond s_[]. This explains why I get a skew where most byte value tallies are halved.

The fix is to replace the return line as follows:
    return s_[(uint8_t) (s_[i_] + s_[j_])];
Where an explicit cast to uint8_t solves the index out of bounds problem.

The suggestion I gave, to add ^ j, only obscures the problem.

Fortunately my code has never been used in a production system. It was written quickly to give me a PRNG for a memory usage simulator I wrote last year for testing a memory allocator where the PRNG (1) must not use memory allocation itself, (2) must not use locking, (3) is reasonably fast and easy to implement. The implementation satisfied all my original requirements but was not RC4.

Sunday, August 5, 2012

A critique of non-blocking deque due to Arora, Blumofe, and Plaxton

Arora, Blumofe and Plaxton presented a non-blocking implementation of work-stealing deque in the paper Thread scheduling for multiprogram multiprocessors (SPAA 1998). Before that, Blumofe worked on the randomized work stealing algorithm for Cilk, with Charles Leiserson.

Here I wish to make a few notes on the non-blocking deque implementation in the SPAA 1998 paper. A deque is described by the following preamble.
// An abstract, opaque type describing a unit of work.
struct Thread;

// A finite but arbitrary large capacity.
#define DEQ_SIZE ...

// An array for storing items in the deque.
Thread *deq[DEQ_SIZE];

// A top pointer that is time-stamped to avoid ABA problem.
struct Age {
  size_t top;
  size_t tag;

volatile Age age;

// Assume there is a compare and swap for struct Age.  Most
// likely this is going to use a double word compare and swap.
// If a machine does not support that, adjust the struct Age
// above to divide a machine word into two bit fields, and use
// single-word compare and swap.
// Modifies newAge to the value of oldAge if the operation
// succeeds.
void cas(volatile Age& mem, Age oldAge, Age& newAge);

// Bottom pointer is manipulated solely by the worker who
// owns the deque, never by thief.
volatile size_t bot;
All of the code below is copied verbatim from the paper; it is missing necessary type annotation, which must be added before the code could compile, but adding type annotation is straightforward.

The owner puts work into the deque using pushBottom() as defined below.
void pushBottom(Thread *thr)
  localBot = bot;
  deq[localBot] = thr;
  bot = localBot;
Here we note that the pushBottom() function doesn't actually respect the capacity of the deque. The deque might overflow without notice. Bad.

A thief steals work using popTop() as defined below. The difference between "bottom" and "top" only serves to illustrate that thief is stealing from opposite end of the deque than the owner. This allows the owner to go on its own merry ways most of the times, until contention with the thief happens when the queue is nearly empty.
Thread *popTop()
  oldAge = age;
  localBot = bot;
  if (localBot <=
    return NULL;
  thr = deq[];
  newAge = oldAge;;
  cas(age, oldAge, newAge);
  if (oldAge == newAge)
    return thr;
  return ABORT;
The assumption is that if the thief manages to increment age, then it is assumed to claim the item at deq[]. If the increment fails due to contention with the owner (queue becomes empty) or with a fellow thief (queue is not empty yet), the current thief gives up. In the latter case, an alternative implementation could decide to retry from the beginning rather than abort. This causes a thief to potentially wait for an unbounded time, but system-wise at least one thief is making progress. The owner removes an item from the deque using popBottom() as follows.
Thread *popBottom()
  localBot = bot;
  if (localBot == 0)  // #a
    return NULL;
  bot = localBot;  // #c
  thr = deq[localBot];
  oldAge = age;
  if (localBot >  // #b
    return thr;
  bot = 0; = 0;
  newAge.tag = oldAge.tag + 1;
  if (localBot == {
    cas(age, oldAge, newAge);
    if (oldAge == newAge)
      return thr;
  age = newAge;
  return NULL;
The deque is empty when either bot is 0 [#a] or if the bottom meets the top [#b]. There is a race condition between the owner and thief. Suppose at the beginning, bot == + 2 (this is necessary for the conditon at [#b] to hold). The owner got momentarily preempted by the operating system right before [#c], and meanwhile two thieves come along and both succeed. The owner wakes up and happily returns the "thr" value that the last thief already took away. This causes a task to be scheduled twice.

Another peculiar fact about the deque is that, suppose we impose an upper bound for bot so that pushBottom() will not go out of bound. The deque might contain far fewer items than its capacity allows, if both bot and top are very close to the upper bound but the deque is not empty. This means that the capacity of deque is only reclaimed when the deque is completely empty. Again, not a very good design in practice.

Wednesday, July 18, 2012

Contiki protothreads

I just want to make a note for myself about Protothreads in Contiki operating system, which is an operating system for devices that are extremely memory constrained. Here are some examples of protothreads.

The protothreads implementation can be found in core/sys/pt.h. A protothread is a function that returns a char indicating its state. The states are PT_WAITING, PT_YIELDED, PT_EXITED, and PT_ENDED. Each operation that causes the thread to block actually makes the function return, so the values of local variables are not preserved.

The resume execution of a protothread is based on “local continuation” which has two implementations. One uses the switch statement to create a Duff's device (see core/sys/lc-switch.h), and another one based on a GCC __label__ extension (see core/sys/lc-addrlabels.h).

Despite the simplicity of the implementation, here are limitations of protothreads:

  • Local variables are in undefined state after context switching, so they must be used extremely carefully. Most examples will let the thread take an external “context” struct argument and store state there.
  • Context switching can only happen inside the main protothread routine but not in sub-function calls. This limitation is very similar to Cilk (not surprising because Cilk is a compiler that generates code very similar to that of protothread).
I think that continuation passing style is the answer. And it looks like someone else agrees.

Saturday, June 30, 2012

Minimal unsatisfiable 3-SAT example

The 3-SAT problem defined here is a stricter version than elsewhere, in the sense that each clause must contain exactly three literals. Some versions permit 1 or 2 literals.

To obtain a minimal unsatisfiable 3-SAT example, we start with a two-clause contradiction representing \( u_1 \wedge \neg u_1 \) as the base case. This base case is not 3-SAT. We then grow the clause as follows: suppose \( C_{i-1} \) containing \( i-1 \) variables is unsatisfiable, then let \( C_i = \left\{ c \vee u_i, c \vee \neg u_i \mid c \in C_{i-1} \right\} \), and \( C_i \) is also unsatisfiable.

To carry out the algorithm, we start with the base case.
  • \( u_1 \)
  • \( \neg u_1 \)
Then we add another variable to each clause:
  • For \( u_1 \):
    • \( u_1 \vee u_2 \)
    • \( u_1 \vee \neg u_2 \)
  • For \( \neg u_1 \):
    • \( \neg u_1 \vee u_2 \)
    • \( \neg u_1 \vee \neg u_2 \)
Then we add another variable to each clause:
  • For \( u_1 \vee u_2 \)
    • \( u_1 \vee u_2 \vee u_3 \)
    • \( u_1 \vee u_2 \vee \neg u_3 \)
  • For \( u_1 \vee \neg u_2 \)
    • \( u_1 \vee \neg u_2 \vee u_3 \)
    • \( u_1 \vee \neg u_2 \vee \neg u_3 \)
  • For  \( \neg u_1 \vee u_2 \)
    • \( \neg u_1 \vee u_2 \vee u_3 \)
    • \( \neg u_1 \vee u_2 \vee \neg u_3 \)
  • For \( \neg u_1 \vee \neg u_2 \)
    • \( \neg u_1 \vee \neg u_2 \vee u_3 \)
    • \( \neg u_1 \vee \neg u_2 \vee \neg u_3 \)
And here we have an unsatisfiable 3SAT example with 3 variables and 8 clauses.

Thursday, June 28, 2012

Oversized tablet

Presumably, one could put together a Planar 32" multi-touch monitor (manual, datasheet) mounted on one of the VESA 400x200 monitor stands and hook the monitor up to one of the NVIDIA development boards with TEGRA processor, such as the Toradex Colibri T30 (announced but not available yet), install Android, and get a giant 32 inch tablet. Such system, along with some custom software, would make it a nice interactive kiosk either at a museum or at home for about $3000 material cost.

Saturday, June 16, 2012

Building autotools from git

For developing GNU software, it is often needed to create the autoconf, automake, and libtool toolchain at the exact version specified, but it's not too hard to build them from source. In my case, I simply did:
  • git clone git://
  • git clone git://
  • git clone git://
And this will let me do a git checkout vN.NN to obtain the specific version of these tools. However, building them from upstream pristine source is different than building from tarball distribution. You will already need a recent version of these autotools to bootstrap the source's build system. Run:
git clean -f -X
will restore the working tree to the pristine condition.

To build autoconf from source:
  • aclocal -I m4 && automake --add-missing && autoconf
  • rm INSTALL
The aclocal -I m4 addresses the following issues:
  • automake complains: MAKE_CASE_SENSITIVE does not appear in AM_CONDITIONAL
  • autoconf complains: error: possibly undefined macro: AC_PROG_GNU_M4
          If this token and others are legitimate, please use m4_pattern_allow.
          See the Autoconf documentation. error: possibly undefined macro: AC_PROG_MAKE_CASE_SENSITIVE
The rm INSTALL addresses the following issue during make, caused by build-aux/missing wanting to overwriting a symlink created by automake --add-missing before, pointing to a system INSTALL file (this would be pretty nasty if you ran make as the same user that installed the automake you used for bootstrapping autoconf, since you would never notice the overwriting).
Making all in .
/bin/sh something/autoconf/build-aux/missing --run makeinfo --no-headers --no-validate --no-split  --plaintext -o something/autoconf/INSTALL \
something/autoconf/INSTALL: Permission denied
Here are the specific versions of these autotools I needed for binutils:
  • autoconf v2.64, automake v1.11.1
In general, for the autoconf version, look for AC_PREREQ in or; for the automake version, look for AM_AUTOMAKE_VERSION in aclocal.m4.

Sunday, June 10, 2012

Some notes for porting Fink packages to 10.7

For all .info file:
  • Add 10.7 to Distribution as a comma separated list.
    Distribution: 10.4, 10.5, 10.6, 10.7
  • Add x86_64 to Architecture
    Architecture: powerpc, i386, x86_64
If this is not done correctly, Fink would say "no package found for specification..." since the .info files are filtered by distribution and architecture. On 10.7, the Intel architecture is x86_64.

Specific to gcc43:
  • Change all gmp dependencies to gmp5 (and get rid of >= version constraints).
  • Change all libmpfr1 dependencies to libmpfr4 (and get rid of >= version constraints).
  • Change ecj-latest.jar MD5 to d7cd6a27c8801e66cbaa964a039ecfdb
  • Remove ln -s ... %i/bin/$binfile-4 and remove Conflicts and Replaces. There is no other reason why gcc 4.x cannot coexist.
  • Replace all traces of i686 with %m. Optionally change --with-tune=generic to --with-tune=core2.
  • To workaround build failure “error: redefinition of a 'extern inline' function 'xxx' is not supported in C99 mode.” Per clang instruction, set CC='gcc -std=gnu89' to workaround it. Another issue surfaced in the second stage is undefined symbol for architecture x86_64 '___builtin___stpncpy_chk' which is resolved by adding -D_FORTIFY_SOURCE=0 to CFLAGS.
    SetCC: gcc -std=gnu89
  • To resolve ld: duplicate symbol _init_inline_once issue, remove redundant tree-inline.o from CXX_C_OBJS in ${SRCDIR}/gcc/cp/
  • Add --enable-libgcj to the ConfigureParams. For some reason this flag is missing, and libgcj is not, but is expected, to be built.
  • Remove Type: -64bit and redundant references of the %lib expansion, particular those in the SplitOff.
The end.

Tuesday, June 5, 2012

Better upper bound for factorial?

Tonight I was puzzled by the question, if an algorithm \( \Pi \) has running time \( O(n!) \), does \( \Pi \) belong to EXPTIME? To rehash, EXPTIME contains all algorithms that has running time \( O(2^{n^k}) \). For brevity, I use the notation \( f(x) \prec g(x) \) to denote that \( f(x) \) grows slower than \( g(x) \). We know that \( 2^n \prec n! \prec n^n \) but I don't think I can show that \( n^n \) belongs to EXPTIME. I need a tighter bound.

It turns out that \( n! \prec 2^{n^2} \).

  • The growth of \( n! \) is \( \frac{(n+1)!}{n!} = n+1 \).
  • The growth of \( 2^{n^2} = (2^n)^n \) is \[
      & = & \frac{(2 \cdot 2^n)^{n+1}}{(2^n)^n} \\
      & = & \frac{(2 \cdot 2^n)^n \cdot (2 \cdot 2^n)}{(2^n)^n} \\
      & = & \frac{2^n \cdot \cancel{(2^n)^n} \cdot (2 \cdot 2^n)}{\cancel{(2^n)^n}} \\
      & = & 2 \cdot 2^{2n} = 2^{2n+1} \\
Since \( n+1 < 2^{2n+1} \) for all \( n \), therefore \( n! \prec 2^{n^2} \).
\( \Box \)
So it turns out that an \( O(n!) \) algorithm belongs to the class \( O(2^{n^k}) \) where \( k = 2 \), so such algorithm indeed belongs to EXPTIME.

Update (June 7, 2012): it turns out that showing \( n^n \) belongs to EXPTIME isn't that hard.

The growth of \( n^n \) is \[
  & = & \frac{(n+1)^n (n+1)}{n^n} \\
  & = & \left( \frac{n+1}{n} \right)^n (n+1) \\
  & = & \left( 1 + \frac{1}{n} \right)^n (n + 1) \\
  & \approx & (n+1)e \\

And that's because \( \lim_{n \to \infty} \left( 1 + \frac{1}{n} \right)^n = e \).  So \( n^n \) even grows slower than \( 2^{n^2} \).

Saturday, May 19, 2012

Locality profile of a disk drive

When I was an undergraduate, I had at least one professor who told his students that we're told lies in order to be told the truth. The lies are simplification of the truth so that we could understand the point the professor wants to get across.

In the case of disk performance, I've never been told the truth, and I never knew it until I found this paper, “An introduction to disk drive modeling.”

We were taught that seek time is the dominant factor of access latency, and that “average seek time” is proportional to 1/3 of the number of tracks. The truth is much more complicated.
  • A seek consists of four phases, speedup (accelerating the head), coast (head travels at constant velocity), slowdown (deceleration of the head), and settle (the head aligns itself to the track). The way shouting causes access latency to spike is probably due to interfering in the settling phase.
  • There is more surface area towards the outer tracks than the inner tracks, and assuming that the disk is to have constant linear density in both x-y dimensions, the angular and radius differential will be non-linear. As thus, we could have either:
    • Constantly spaced tracks where inner tracks have fewer sectors and outer tracks have more sectors, or
    • Tracks with constant number of sectors, where inner tracks are further apart and outer tracks are closer, or
    • A hybrid model that is somewhere in between, depending on which geometry gives the better seek time.
Also, most disk drives nowadays expose a linear addressed sector scheme as opposed to CHS (cylinder, head, sector) which is specific to disk geometry. The linear addressing hides away the complex geometry that is disk specific, and all we care about nowadays is the seek time profile from one linear addressed sector to the other sectors.

Quite interestingly, where as in the case of random accessed memory, temporal locality prevails spatial locality, it is the other way with disk drives. For random accessed memory, spatial locality at worst will cause n times performance degradation where n is the size of the cache line divided by the word size, a lack of temporal locality can often cause a total breakdown of the memory hierarchy. For disk access, the lack of temporal locality means that a group of n read/write operations that could previously be grouped together in one single seek sweep would have to be performed as different seeks back and forth, it is the lack of spatial locality of the on-disk data arrangement that causes the seek time to soar.

As part of a disk-based filesystem design, it should allow files accessed together to belong to a (spatial) locality group.

There are already databases that store columns into locality groups of columns, but that sort of locality group only affects the “virtual” linear address within a file, which by virtue of the filesystem may be fragmented throughout the disk.

Saturday, April 21, 2012

Zero-Latency Computer Audio Effect Notes

When it comes to audio processing, it is believed that the human ear can detect at least 13ms of delay. In a computer audio setting, a large delay can be attributed to the underlying operating system. For example, it could be that the OS or driver does not handle interrupts (making input data available) fast enough, not scheduling threads to process the data soon enough, and making too many copies of audio data which delays the input as well as output. On Windows, the latency could be as high as 100ms. On Linux with preemptible and low-latency kernel, latency lower than 1ms is achievable. (These numbers are pulled from memory, and if any reader cares to point me to some latency measurement results—more recent the better—that would be very welcome).

But beyond the operating system, computer audio latency is also limited by the sampling rate as well. For example, when recording stereo audio at 44100Hz with 16-bit samples, filling a 1024 byte buffer takes 6ms. A buffer size for sub-millisecond latency would need to be 128 bytes, which is about the size of a memory cache line. This translates to about 32 samples. A very small number for audio effects processing. For example, computing Fourier Transform on a block size this small is not very interesting. This means that in order to achieve zero-latency computer audio effect, the algorithm has to be designed as a stream of samples.

Here I'm just noting two algorithms that are streamable: low-pass and high-pass filters. Both algorithms have a “discrete-time realization” (taken from Wikipedia) as follows:

  • Low-pass filter: \[ y_i = \alpha x_i + (1 - \alpha) y_{i-1} \qquad \text{where} \qquad \alpha \triangleq \frac{\Delta_T}{RC + \Delta_T} \]
  • High-pass filter: \[ y_i = \alpha y_{i-1} + \alpha (x_{i} - x_{i-1}) \qquad \text{where} \qquad \alpha \triangleq \frac{RC}{RC + \Delta_T} \]
In both cases, the cutoff frequency is \( f_c = \frac{1}{2\pi RC} \), and \( \Delta_T \) is the duration of time between samples (the reciprocal of the sampling rate).

Saturday, April 14, 2012

A proposal for remote procedure call IDL

This evening I decided to give some thought about a good RPC design, with a simple semantics and efficient wire format. There is a lot of prior art, e.g. SUN RPC, SMB, CORBA, SOAP, XMLRPC, and I don't want to go into the detail of any one of them. The XML based wire format is not efficient. The ASN.1 or XDR based wire format are fine, but JSON is simpler. It also has an efficient wire format called MessagePack. However, the RPC specification that came with MessagePack isn't satisfactory. One prominent missing feature is the streaming RPC. Its IDL is also tied to the underlying data representation. What I'm looking for is a simply typed way to describe a JSON value. The overall RPC semantics should be like simply-typed JavaScript.

Here is an informal proposal for the type of a simply-typed JSON value.

key-type := number | boolean | string | enum-type
key := number-value | boolean-value | string-value | enum-value
opt-key := key | optional key

type := key-type | object
      | [type]   // array
      | (type1, type2, ..., typen)   // tuple
      |    // associative map
        {opt-key1: type1, opt-key2: type2, ..., opt-keyn: typen}
      | nullable type

The type describes a JSON object. It can be:
  • One of the primitive types "number," "boolean," or "string," or it could be simply "object" for any JSON object without more refined type description.
  • The array syntax is for describing an array where all items are of the same type.
  • The tuple is for describing a sequence of objects of various types, even though in JSON it would be represented as an array as well.
  • The associative map is a key-value mapping from a concrete value of the key to a type. This is the refined description of a complex JSON object. The concrete key values can be a number, a boolean, or a string. The key values can be optionally prefixed by "optional" to indicate that the key could be missing.
  • A type with the "nullable" qualifier which indicates that the value could be null.
Furthermore, for the ease of defining constants or symbolic names of associative map, we allow the user to define enums which map a symbolic name to a number. The enum type's name could be used wherever a type is expected, and the enum symbols could be used whenever a key is expected.

An interface is a collection of function calls with the signature:

typearg -> typeret throws type1, type2, ..., typen

where typearg is the type of the function argument, typeret is the type of the return value, and the list of types after "throws" are the exception types that could be raised by the function.

What is unique about this proposal is the unified treatment of bidirectional and streaming RPC. Typically, streaming RPC is a way for the server to send data in multiple responses, asynchronously back to the client. Client can always stream data to the server by making a call to server as many times as necessary. Here the streaming RPC is generalized to the concept of a co-interface. Whenever server needs to make a part of the response available, it would invoke one of the functions in the co-interface. A co-interface is supposed to be implemented by the client for receiving calls from the server, in order to connect to the server. Hence, the complete service description consists of a server-side interface and a client-side co-interface, which is how bidirectional RPC can be made.

Update: June 1, 2012. I've decided to get rid of exception (throw) and instead use cointerface for signaling alternative continuation.

Sunday, March 18, 2012

The real problem with search engine optimization... or the web?

This evening, I read a news that Google is Planning to Penalize Overly Optimized Sites, and I found it through Slashdot. What really interests me is that the way I found the news is an illustration of a search engine optimization problem itself. It turns out the real "beef" of the news behind the links is 5 degrees to the secondary source and 2 more degrees to the primary source. Each time the story is linked, another person either adds a little bit more insight or just summarizes the linked successor slightly differently without much value added.

Here is the link structure of the story.

The distinction between primary source and secondary source is made that secondary source creates new insight from the primary source, but the primary source is just a recording of facts without any insight. In this case, the primary source included an audio recording of a panel discussion, and the secondary source highlighted pieces of it with its own interpretation.

Note that CNET also posts the link to the primary source on, but without context that it is the primary source.

I think this particular example of the search engine optimization story illustrates the problem very well. In primary school, you were taught that every reference is either a primary source or a secondary source, but the reality is always more complicated. Some secondary sources are primary sources in some aspect, and tertiary sources can still add value (or you can argue that the third category is also a secondary source). The web makes it much more easier to link, and now people also have financial incentive---the more they get cited, the more ads they can show, and that gives them a profit. They can do that without generating any original content, or they can add a little value by adding their own interpretation. The ads revenue they generate, on the other hand, depends on how much effort they put in promoting their post, so that other people will want to link to them.

Let's put the financial motive aside and assume that people will always link without ads revenue. If a person can get away showing tons of ads but actually adds tons of value to some topic, then why penalize him?

The issue is that getting to the "beef" of the story is still a graph search problem. The search engine (e.g. Google Bot) supposedly has all the link graph information, but it does not understand the distinction between primary and secondary sources, and it's up to the reader to investigate by spending time and effort. With the amount of information exploding each day, we really need a search engine that will do better. If it's too much to ask a search engine to understand the difference between primary and secondary sources, I think it is at least plausible to have a tool to highlight the contribution of each page with regard to a particular aspect or topic---like a "diff" tool with fuzzy syntactic matching.

Friday, March 9, 2012

The Tower Hotel Problem

Dynamic storage allocation problem, both as an online problem and an offline problem, is widely studied in literature, such as this article by Jordan Gergov. Here I present a similar problem which I call "The Tower Hotel Problem."

The Tower Hotel is peculiar. It has an infinite number of floors, but each floor has the same fixed number of rooms. The proprietor wanted to keep the number of floors open to the public minimal in order to save costs on maintenance; a closed floor incurs no cost. A floor is closed if none of the rooms are occupied. But as long as there is at least one occupied room, the floor must be open to public. Once a guest arrives and settles in a room, the proprietor must not ask the guest to move to another room.

In a zealous attempt to minimize cost, the proprietor requires all guests to make reservations before the year begins. The guest would indicate upon reservation the arrival and departure dates. Once the year begins, no more reservation can be made. The proprietor then optimizes the placement of guests on floors. This eccentric policy may be one reason the hotel is not very well known.

A number of years passed, and the proprietor passed away. His son knows nothing about hotel management, so the son interviewed three potential managers. In order to make the hotel more popular, the son required that the managers must not assume the reservations are made in advance. In fact, he went a step further and said that a guest could arrive at any time and leave at any time.

Each manager had a different strategy. The first manager would place the guest at random on any of the opened floors with vacancy (and open up a new floor if none of the existing floors are available). The second manager would place the guest on the floor with the most number of occupied rooms (but still has vacancy). The third manager would place the guest on the lowest numbered floor with vacancy. The third manager has an assistant who, due to her desire to be promoted, secretly proposes to the proprietor's son a slightly different strategy than her manager: she would locate the lowest numbered floors with vacancy, fill it entirely first, then find again the lowest numbered floors with vacancy, which may be lower than the previously chosen floor because some guests below may have checked out.

How does the proprietor's son determine which manager has the most cost-effective strategy? How do these managers and the third manager's assistant compare with what the proprietor could do, when the proprietor had full knowledge of all the reservations in the whole year?

This problem describes segregate fits dynamic storage allocation with virtual memory where all objects are of the same size, and a memory page could fit a fixed number of these objects. A virtual memory page has to be backed by a physical memory page if there is at least one object in the page. Once all objects in the page are freed, the physical memory page backing can be removed. This presents a different sort of fragmentation problem where certain long-lived objects can pin a memory page even though the page is mostly unused.