Life of a Computer Scientist: November 2012

Thursday, November 29, 2012

MISC: binary Galois Field multiplication

After reading NIST may not have you in mind, I decided that a new ISA should support binary Galois Field multiplication. Binary Galois Field is a special case of binary polynomial \( \mathbb{Z}_2[x] \) where addition and subtraction are already supported as XOR, written as \( \oplus \). What is missing for GF is multiplication and division in \( \mathbb{Z}_2[x] \). The binary polynomials are expressed as bit vectors where the \(n\)-th bit is the coefficient of the \(x^n\) term. Let \( \otimes \) be polynomial multiplication.

pmul: \( (a, b) \to (r, s) \) where \( r s = a \otimes b \), i.e. \( r \) is the upper word of the result, and \( s \) is the lower word.
pdiv.mod: \( (a, b) \to (q, r) \) where \( q \otimes b = a \oplus r \), i.e. \( q \) is the quotient and \( r \) is the remainder.

It's been a while since I last studied abstract algebra, so I might not have gotten all my terminology correct, but you get the idea.

Friday, November 23, 2012

MISC: a high-level comparison to TTA

There is this old saying that there is nothing new under the sun. While I thought I invented something new, it came to my attention the Transport Triggered Architecture (TTA) which I did not know before. I found some papers as early as 1994, by Henk Corporaal.

Both MISC and TTA specify an instruction set architecture where each ALU has associated input and output ports, and writing to the input ports trigger a computation.

MISC falls short describing how the instruction set would actually be implemented in hardware. It assumes that there is a hardware scheduler that can detect when the output port is ready and advances the pipeline. The compiler only needs to worry about transforming the program into a set of pipelines. This abstracts out the timing details such as gate length out of the instruction set architecture.

TTA programs consist of cycle-stepped move instructions that has to take gate length into account, due to the lack of a global control logic. If an operation requires two inputs—one is called the operand and the other is called a tigger—the operand must be stored before or during the same cycle as the trigger. If the program moves a value out of the result too early, it would read out a corrupt value. This means that the compiler has to take gate length into account while scheduling instructions. If a function unit takes too long (e.g. cache misses) to complete an operation, it must lock the data transport bus so additional move would not take place.

In contrast, MISC exposes no cycle-sensitive timing in the pipeline. The monadic design of the functional units allows pipelines to synchronize by data dependency graph, not by timing. Hence there is no need to globally lock the pipeline. I mentioned one instruction to wait for all moves in the instruction queue to finish, but I only added there for the purpose of context switching. I still need to elaborate more on how I plan to support preemptive multitasking.

TTA already has a hardware implementation. It is not clear if the idealized scheduler for MISC is implementable in hardware, or if it turns out to be the performance bottleneck.

Saturday, November 10, 2012

MISC: Computation Modules

This is a continuation of Microcosmic Instruction Set Computer, defining the computation modules before I decide on the port assignments. The notation is \( (I_1, I_2, ..., I_m) \to (O_1, O_2, ..., O_n) \) for input ports \( I_1 ... I_m \) and output ports \( O_1 ... O_n \). The total number of input and output ports for each operator are kept as power of 2 to allow less fragmentation in port assignment later.

Arithmetic operators

add: \( (a, b) \to (r, c) \) where \( r = a + b \) and \( c \) is the carry bit.
sub: \( (a, b) \to (r, c) \) where \( r = a - b \) and \( c \) is the carry bit.
neg: \( a \to b \) where \( b = -a \).
mul: \( (a, b) \to (r, s) \) where \( r s = a \times b \), i.e. \( r \) is the upper word of the result, and \( s \) is the lower word.
div.mod: \( (a, b) \to (q, r) \) where \( q \times b = (a - r) \), i.e. \( q \) is the quotient and \( r \) is the remainder.
smul, sdiv.mod: same as mul and div.mod but for signed integers.
and: \( (a, b) \to (c, c') \) where \( c = a \wedge b \) and \( c' = a \wedge \neg b \), both bitwise.
or.xor: \( (a, b) \to (c, c') \) where \( c = a \vee b \) and \( c' = a \oplus b \), both bitwise.
rot.left: \( (a, c) \to (r, s) \) where \( r \) is the result of shifting \( a \) to the left by \( c \) bits, and \( s \) is the overflow part of \( a \) adjacent to the least significant bit.
rot.right: \( (a, c) \to (r, s) \) where \( r \) is the result of shifting \( a \) to the right by \( c \) bits, and \( s \) is the underflow part of \( a \) adjacent to the most significant bit.
not: \( a \to b \) where \( b = \neg a \), bitwise.

Signed integers use two's complement representation, so they don't need to distinguish addition and subtraction. Note that the conditional branch instruction should be able to use the condition from the least significant bit which is the carry bit, as well as the most significant bit which is the signed bit. This allows comparing both signed and unsigned integers.

Convenience operators

lea: \( (b, a, x) \to r \) computes \( r = 2^a \cdot x + b \).
add3: \( (a, b, c) \to r \) computes \( r = a + b + c \).
or3: \( (a, b, c) \to r \) computes \( r = a \vee b \vee c \).
xor3: \( (a, b, c) \to r \) computes \( r = a \oplus b \oplus c \).

Any of these operators—especially “or” or “xor”—could be used for the purpose of joining the completion of all computations. These are not to be confused with non-deterministic join. A non-deterministic join is not necessary; just have two pipes connecting from some two different output ports to the same input port.

Conditional operators

cond: \( (c, a, b) \to r \) where \( r = c \mathop{?} a : b \), i.e. returns \( a \) if \( c \) is nonzero, otherwise \( b \).
cx: \( x \to c \) where \( c \) extends the carry bit of \( x \) to fill the whole word.
sx: \( x \to s \) where \( s \) extends the signed bit of \( x \) to fill the whole word.

Memory controller

read: \( a \to r \) where \( r \) is the value of memory word at address \( a \) which must be word-aligned.
write: \( (a, x, z) \to z' \) to write \( x \) to the memory word at address \( a \) which must be word-aligned, whenever \( z \) is also ready. The value of \( z \) is not used. When the write finishes, \( z' \) becomes ready with a zero value.
cswap: \( (a, new, old) \to old' \) writes \( new \) to the memory word at address \( a \) only when \( a \) still contains the expected \( old \) value. The actual value of \( a \) before the swap is returned as \( old' \).

Byte packing

pack: \( (a, b, p) \to r \) where \( r \) is the result of permuting the concatenation of \( a b \) according to mapping \( p \).
unpack: \( (a, p) \to (r, s) \) where \( r \) and \( s \) are the two words selected from \( a \) by mapping \( p \) and they can be optionally sign extended.

Coprocessor I/O

exec: \( (k, a_1, a_2) \to r \) sends a two-argument command to co-processor \( k \) and returns its result in \( r \).
exec2: \( (k, a_1, a_2, a_3, a_4, a_5) \to (r_1, r_2) \) sends a five-argument command to co-processor \( k \) and returns a pair of results in \( r_1 \) and \( r_2 \).

This can be used to drive the floating point coprocessor as well. The floating point coprocessor, for example, would occupy a range of k in the coprocessor namespace, one for each operation.

Some concluding remark

The fact that all computation modules above have non-empty inputs and outputs means that this forms a monadic set of operators, where synchronization is accomplished by port pipelining. I guess you can call this Monadic Instruction Set Computer also.

Because of the fine granularity of some of the computation modules, it makes sense to base the cycle on the finest granule which is the gate distance, e.g. the bitwise operators are all single gate distance, hence single cycle. This means that while some modules are single cycle, others can be hundreds or thousands of cycles.

Friday, November 9, 2012

Microcosmic Instruction Set Computer

When it comes to instruction set architecture, there are many philosophies, ranging from CISC to RISC down to extremes like OISC or ZISC. The dominent is still Intel x86 or x86-64 which is CISC, but ARM is getting popular too which is RISC. I've not seen any commercial product based on OISC or ZISC so they are probably not practical.

Having had some experience in Internet planetary scale distributed computing where remote procedure calls are made between computer services, coming back to looking at instruction set design gives me the revelation that even a single-core microprocessor is itself a distributed system. Rather than a distributed system of disk storage, memcache, and web servers, the microprocessor is a distributed system of arithmetic logic units, memory controller, and I/O buses. This gives rise to the idea of a microcosmic instruction set computer where the instruction set takes care of basic register file and control flow, but offloads all computing and memory I/O activities to the microcosm of distributed services on the processor die.

The instruction set features:

Some (yet-to-be specified) instructions for unconditional and conditional branching, namely to affect the instruction fetching.
A small I/O address space (~16K?) each specifies a word-sized port, which are buffered memory that can be written to and read from. Each port also has a “ready” bit to signal the availability of data, for the purpose of instruction scheduling.
Lower tiered I/O ports (0-15) are simply buffered general purpose register file. Middle tier ports (16-255) are for multiple ALUs, memory controllers, and external I/O controllers. Upper tier ports (256-?) are laid out in groups of 256 like the lower and middle tier ports (0-255) to enable additional instruction level parallelism.
An instruction specifies a move from one source port to a destination port. The instruction is only executed when data for that port is ready, but multiple instructions can be queued by the instruction scheduler. The move instruction can be seen as “connecting” a pair of ports.
An instruction to write a small constant value directly to a port.
One instruction to wait for all moves in the instruction queue to finish.

All ports can be used like a register (i.e. they are buffered), but some ports are used for inputs and some are used for outputs. For example, an adder for \( i + j = k \) would occupy three ports, \( p_i \), \( p_j \), and \( p_k \). The adder begins working whenever the ready bits of \( p_i \) and \( p_j \) are set, and the ready bit of \( p_k \) is only asserted when the result is available. The adder can be powered off when it's not doing work.

The instruction scheduler could dynamically map port numbers in groups of 256 if it wishes to turn off additional die area to reduce power use even further.

Even within a single group, ports of the same functionality may indeed be a queue to a smaller number of units. For example, the port assignment might give \( 8 \times 3 = 24 \) ports to an adder, but a particular chip might only have 2 physical adders doing the work of 8 logical ones. It is particularly useful to have multiple logical units of memory controller to allow memory I/O to be queued.

To be continued...