Tuesday, November 10, 2015

From Branch Prediction to Predictable Branching

Branch prediction is a technique in multi-stage pipelined CPU design to avoid pipeline stall. The reason of the stall is that early stages of the pipeline such as instruction fetch and decode cannot happen unless the CPU knows where to fetch the next instruction, but it has to wait for the result from earlier instructions to know which branch to take. Branch prediction speculatively executes one of the branch. If the prediction turns out to be good, the results of the execution is committed, otherwise the pipeline is flushed, and execution restarts from the actual branch.

There are two constructs in the program that cause branching. Conditional branches come from if-statements, where the code path takes the then-clause if the condition is true, or the else-clause otherwise. Indirect branches come from function pointers (used by callbacks), virtual method tables (used by object-oriented languages), and switch-statements. Unconditional direct branch does not need to be predicted.

Branch prediction has been well-studied, such as how various CPU microarchitectures implement branch prediction, and the limits of branch prediction. Since prediction misses are expensive, they measure the effectiveness of branch prediction by the percentage of hits. But I argue that branch prediction might not be the best approach to solve the pipeline stall problem. The better idea is to make branches predictable.

Predictable Indirect Branches

Let's say the program is about to call a virtual method of an object. The first word of the object is a pointer to the virtual table. The method pointer is the 7th word in the virtual table. The object pointer is stored in register %r0, and the stack pointer is in register %sp. For brevity, pointer arithmetic is expressed in the unit of a machine word.
; zeroth argument to the method is the object pointer.
store %r0 -> %sp + 0

; compute the arguments and store them.
...
store %r1 -> %sp + 1
...
store %r2 -> %sp + 2

; prepare to call the method.
load %r0 -> %r3      ; load address of the virtual table
load %r3 + 7 -> %r4  ; load address of the method
call %r4
The reason of the pipeline stall is because the call instruction doesn't know what is in %r4 until the instruction to load method address into %r4 is executed.

I included the code to compute the function call arguments to emphasize the fact that there is a lot of work that the CPU could be doing before the real branch target is computed. We can avoid the pipeline stall if we just reorder the work slightly.
; prepare to call the method.
load %r0 -> %r3      ; load address of the virtual table
load %r3 + 7 -> %r4  ; load address of the method
setcall %r4  ; set the branch target but do not branch yet.

; zeroth argument to the method is the object pointer.
store %r0 -> %sp + 0

; compute the arguments and store them.
...
store %r1 -> %sp + 1
...
store %r2 -> %sp + 2

call  ; do the actual call.
In this latter example, we compute the branch target first and do the other work later. By the time the pipeline needs to decode the next instruction after the branch, setcall should have been executed and provided us with the actual target. If not, it is okay for the pipeline to stall for a bit. Programs that do not use setcall/call can still benefit from the branch prediction buffer.

Predictable Conditional Branches

Conditional branches are a bit trickier because we can't generally precompute the condition. One technique used by the programmer is to compute both branches and mask the result of either one. The mask can be created using a bit manipulation technique. For example, the statement:
int result = condition? expression_true : expression_false;
Can be decomposed as:
int result_true = expression_true;
int result_false = expression_false;
int mask = to_mask(condition);
int result = (mask & result_true) | (~mask & result_false);
This approach only works if the expressions are trivial to compute. Otherwise the redundant work would be more expensive than pipeline stall.

A more general approach recognizes the fact that conditional branches are often asymmetric: one branch is the fast path in a tight loop, and the other branch is the slow path. Take for example an implementation of the strlen(s) function.
size_t strlen(const char *s) {
  size_t n = 0;
  while (*s)
    ++n, ++s;
  return n;
}
Its assembly pseudo-code might be:
strlen:
  load %sp -> %r1  ; load s into %r1
  xor %r0, %r0 -> %r0  ; zero %r0 as n

loop:
  loadbyte %r1 -> %r2  ; dereference *s
  bz %r2, endloop  ; branch if zero
  inc %r0
  inc %r1
  b loop  ; unconditional, no prediction needed.

endloop:
  r  ; return value in %r0
The basic block between loop and endloop is the tight loop. One way to argue that it must run as quickly as possible is because it runs many times; that's the classic branch prediction argument that sought to maximize the probability of prediction hits.

But we can also argue that branches should always favor the tight loop because the consequence of taking the bz branch result in more expensive code (function return), whereas the cost of not taking the branch is cheap. This is true even if s is an empty string. In other words, we are making the branch predictable by analyzing whether the cost of prediction miss can be amortized into the code that follows.

The consequence of misprediction can be visualized as the following table:

Fast PathSlow Path
HitFor The Win!Not a big win.
MissHuge penalty!Small penalty.

There are two ways to distinguish fast and slow paths. The compiler can insert branch hints, which should work well because branch misprediction cost is static. The CPU can also decode ahead and compute the path cost of both branches. In the case where it is difficult to tell between the fast and slow paths, the CPU can always use the branch prediction heuristic based on historical data.

Conclusion

Here I presented methods to make branching predictable. For the indirect branch, the idea is to compute the branch target as early as we can before the actual branch, and there are usually more work that could be done in the mean time. For the conditional branch, the idea is to distinguish the fast and slow paths, and prefer to branch to the fast path because the slow path is penalized less by miss cost. Both methods can be used in addition to the branch prediction buffer based on historical data.