Friday, April 30, 2010

Are fast integer min and max operators portable?

Sometimes when writing high performance code, hackers have come to use hard coded assembly to compute integer min(x, y) and max(x, y) in a way described in The Aggregate MAGIC. The formula for integer min is x + (((y - x) >> (WORDBITS - 1)) & (y - x)). Of course, the common expression (y - x) can be computed just once. The constant (WORDBITS - 1) can be computed at compile time. This expression requires 4 instructions. This code is reasonably portable, as most computer architectures nowadays are two's complement, and they support addition, subtraction, and sign-preserving bit shifts.

Compared with the usual min(x, y) defined as:
int min(int x, int y) {
  int result;
  if (x < y)
    result = x;
  else
    result = y;
  return result
}
which is essentially what the expression (x < y)? x : y gets compiled to at the assembly level, the idea is that the "magic" version does not use branching, so it will work best with processors with pipelined instruction execution. The branching would cause the pipeline to be flushed. A naive processor with no branch prediction and speculative execution will suffer a heavy penalty.

On modern CPU, a lot of transistor die is dedicated to branch prediction and speculative execution. One chief reason is that it alleviates the burden on the programmer, so the programmer can continue to write simple code. It also alleviates the burden on compiler writer, so that the compiler does not need the heavy machinery to recognize pattern in code like (x < y)? x : y and substitute it for the branch-free version. However, the transistors doing branch prediction and speculative execution draw a lot of electric power. They also produce a lot of heat, requiring more power in a data center for heat management. It is becoming unfashionable in a "green conscious" economy.

Here comes the ARM processor that powers mobile phones and embedded devices. The reason it is used for low-power applications is because of its energy efficiency. Words are that ARM will enter the data center market in 2011 allowing data centers to reduce its energy use. In terms of technical merits as a server, ARM began to support atomic compare and swap instruction, a key ingredient in coordinating multi-threaded programs, in ARMv6 instruction set. In terms of hardware implementation, ARM11 MPCORE and ARM Cortex-A9 are multi-core processors.

One design choice that leads to the low-power requirement of ARM processor is its instruction set. Instead of having branch prediction and speculative execution, each instruction is conditionally executed. It allows a branch-free rendition of the min and max operator essentially like this:
int min(int x, int y) {
  bool cond = (x <= y);
  int result;
  cond && (result = x);
  !cond && (result = y);
  return result;
}
Hence, the expression (x < y)? x : y can be compiled to 3 instructions.

This presents an interesting dilemma to the programmer.
  • If he were to program the "magic" version, his code will run faster on all pipelined architectures, but it may run slightly slower on ARM.
  • If he were to use the standard (x < y)? x : y expression, the compiler may or may not optimize this code. He may not get consistent performance when using a different compiler.
Furthermore, it presents an interesting challenge to compiler writers. When writing a multi-target compiler, the source code is often compiled to a platform-neutral intermediate code by a front-end, and then from the intermediate code to the target machine code by the back-end. The intermediate code may have scrambled the source code enough so that it is hard for the back-end to recognize simple patterns like (x < y)? x : y and perform pin-hole optimization.

Finally, the programmer who wants the performance guarantee would probably just write in-line assembly for each target machine architecture. This is just another fine example where, if you want to maximize the ability of hardware, you still need to write code for that particular architecture.

No comments: