There are two constructs in the program that cause branching. Conditional branches come from if-statements, where the code path takes the then-clause if the condition is true, or the else-clause otherwise. Indirect branches come from function pointers (used by callbacks), virtual method tables (used by object-oriented languages), and switch-statements. Unconditional direct branch does not need to be predicted.
Branch prediction has been well-studied, such as how various CPU microarchitectures implement branch prediction, and the limits of branch prediction. Since prediction misses are expensive, they measure the effectiveness of branch prediction by the percentage of hits. But I argue that branch prediction might not be the best approach to solve the pipeline stall problem. The better idea is to make branches predictable.
Predictable Indirect Branches
Let's say the program is about to call a virtual method of an object. The first word of the object is a pointer to the virtual table. The method pointer is the 7th word in the virtual table. The object pointer is stored in register%r0
, and the stack pointer is in register %sp
. For brevity, pointer arithmetic is expressed in the unit of a machine word.
; zeroth argument to the method is the object pointer. store %r0 -> %sp + 0 ; compute the arguments and store them. ... store %r1 -> %sp + 1 ... store %r2 -> %sp + 2 ; prepare to call the method. load %r0 -> %r3 ; load address of the virtual table load %r3 + 7 -> %r4 ; load address of the method call %r4The reason of the pipeline stall is because the call instruction doesn't know what is in
%r4
until the instruction to load method address into %r4
is executed.I included the code to compute the function call arguments to emphasize the fact that there is a lot of work that the CPU could be doing before the real branch target is computed. We can avoid the pipeline stall if we just reorder the work slightly.
; prepare to call the method. load %r0 -> %r3 ; load address of the virtual table load %r3 + 7 -> %r4 ; load address of the method setcall %r4 ; set the branch target but do not branch yet. ; zeroth argument to the method is the object pointer. store %r0 -> %sp + 0 ; compute the arguments and store them. ... store %r1 -> %sp + 1 ... store %r2 -> %sp + 2 call ; do the actual call.In this latter example, we compute the branch target first and do the other work later. By the time the pipeline needs to decode the next instruction after the branch,
setcall
should have been executed and provided us with the actual target. If not, it is okay for the pipeline to stall for a bit. Programs that do not use setcall
/call
can still benefit from the branch prediction buffer.Predictable Conditional Branches
Conditional branches are a bit trickier because we can't generally precompute the condition. One technique used by the programmer is to compute both branches and mask the result of either one. The mask can be created using a bit manipulation technique. For example, the statement:int result = condition? expression_true : expression_false;Can be decomposed as:
int result_true = expression_true; int result_false = expression_false; int mask = to_mask(condition); int result = (mask & result_true) | (~mask & result_false);This approach only works if the expressions are trivial to compute. Otherwise the redundant work would be more expensive than pipeline stall.
A more general approach recognizes the fact that conditional branches are often asymmetric: one branch is the fast path in a tight loop, and the other branch is the slow path. Take for example an implementation of the
strlen(s)
function.
size_t strlen(const char *s) { size_t n = 0; while (*s) ++n, ++s; return n; }Its assembly pseudo-code might be:
strlen: load %sp -> %r1 ; load s into %r1 xor %r0, %r0 -> %r0 ; zero %r0 as n loop: loadbyte %r1 -> %r2 ; dereference *s bz %r2, endloop ; branch if zero inc %r0 inc %r1 b loop ; unconditional, no prediction needed. endloop: r ; return value in %r0The basic block between
loop
and endloop
is the tight loop. One way to argue that it must run as quickly as possible is because it runs many times; that's the classic branch prediction argument that sought to maximize the probability of prediction hits.But we can also argue that branches should always favor the tight loop because the consequence of taking the
bz
branch result in more expensive code (function return), whereas the cost of not taking the branch is cheap. This is true even if s
is an empty string. In other words, we are making the branch predictable by analyzing whether the cost of prediction miss can be amortized into the code that follows.The consequence of misprediction can be visualized as the following table:
Fast Path | Slow Path | |
Hit | For The Win! | Not a big win. |
Miss | Huge penalty! | Small penalty. |
There are two ways to distinguish fast and slow paths. The compiler can insert branch hints, which should work well because branch misprediction cost is static. The CPU can also decode ahead and compute the path cost of both branches. In the case where it is difficult to tell between the fast and slow paths, the CPU can always use the branch prediction heuristic based on historical data.