Having had some experience in Internet planetary scale distributed computing where remote procedure calls are made between computer services, coming back to looking at instruction set design gives me the revelation that even a single-core microprocessor is itself a distributed system. Rather than a distributed system of disk storage, memcache, and web servers, the microprocessor is a distributed system of arithmetic logic units, memory controller, and I/O buses. This gives rise to the idea of a microcosmic instruction set computer where the instruction set takes care of basic register file and control flow, but offloads all computing and memory I/O activities to the microcosm of distributed services on the processor die.
The instruction set features:
- Some (yet-to-be specified) instructions for unconditional and conditional branching, namely to affect the instruction fetching.
- A small I/O address space (~16K?) each specifies a word-sized port, which are buffered memory that can be written to and read from. Each port also has a “ready” bit to signal the availability of data, for the purpose of instruction scheduling.
- Lower tiered I/O ports (0-15) are simply buffered general purpose register file. Middle tier ports (16-255) are for multiple ALUs, memory controllers, and external I/O controllers. Upper tier ports (256-?) are laid out in groups of 256 like the lower and middle tier ports (0-255) to enable additional instruction level parallelism.
- An instruction specifies a move from one source port to a destination port. The instruction is only executed when data for that port is ready, but multiple instructions can be queued by the instruction scheduler. The move instruction can be seen as “connecting” a pair of ports.
- An instruction to write a small constant value directly to a port.
- One instruction to wait for all moves in the instruction queue to finish.
The instruction scheduler could dynamically map port numbers in groups of 256 if it wishes to turn off additional die area to reduce power use even further.
Even within a single group, ports of the same functionality may indeed be a queue to a smaller number of units. For example, the port assignment might give \( 8 \times 3 = 24 \) ports to an adder, but a particular chip might only have 2 physical adders doing the work of 8 logical ones. It is particularly useful to have multiple logical units of memory controller to allow memory I/O to be queued.
To be continued...