Friday, April 30, 2010

Are fast integer min and max operators portable?

Sometimes when writing high performance code, hackers have come to use hard coded assembly to compute integer min(x, y) and max(x, y) in a way described in The Aggregate MAGIC. The formula for integer min is x + (((y - x) >> (WORDBITS - 1)) & (y - x)). Of course, the common expression (y - x) can be computed just once. The constant (WORDBITS - 1) can be computed at compile time. This expression requires 4 instructions. This code is reasonably portable, as most computer architectures nowadays are two's complement, and they support addition, subtraction, and sign-preserving bit shifts.

Compared with the usual min(x, y) defined as:
int min(int x, int y) {
  int result;
  if (x < y)
    result = x;
    result = y;
  return result
which is essentially what the expression (x < y)? x : y gets compiled to at the assembly level, the idea is that the "magic" version does not use branching, so it will work best with processors with pipelined instruction execution. The branching would cause the pipeline to be flushed. A naive processor with no branch prediction and speculative execution will suffer a heavy penalty.

On modern CPU, a lot of transistor die is dedicated to branch prediction and speculative execution. One chief reason is that it alleviates the burden on the programmer, so the programmer can continue to write simple code. It also alleviates the burden on compiler writer, so that the compiler does not need the heavy machinery to recognize pattern in code like (x < y)? x : y and substitute it for the branch-free version. However, the transistors doing branch prediction and speculative execution draw a lot of electric power. They also produce a lot of heat, requiring more power in a data center for heat management. It is becoming unfashionable in a "green conscious" economy.

Here comes the ARM processor that powers mobile phones and embedded devices. The reason it is used for low-power applications is because of its energy efficiency. Words are that ARM will enter the data center market in 2011 allowing data centers to reduce its energy use. In terms of technical merits as a server, ARM began to support atomic compare and swap instruction, a key ingredient in coordinating multi-threaded programs, in ARMv6 instruction set. In terms of hardware implementation, ARM11 MPCORE and ARM Cortex-A9 are multi-core processors.

One design choice that leads to the low-power requirement of ARM processor is its instruction set. Instead of having branch prediction and speculative execution, each instruction is conditionally executed. It allows a branch-free rendition of the min and max operator essentially like this:
int min(int x, int y) {
  bool cond = (x <= y);
  int result;
  cond && (result = x);
  !cond && (result = y);
  return result;
Hence, the expression (x < y)? x : y can be compiled to 3 instructions.

This presents an interesting dilemma to the programmer.
  • If he were to program the "magic" version, his code will run faster on all pipelined architectures, but it may run slightly slower on ARM.
  • If he were to use the standard (x < y)? x : y expression, the compiler may or may not optimize this code. He may not get consistent performance when using a different compiler.
Furthermore, it presents an interesting challenge to compiler writers. When writing a multi-target compiler, the source code is often compiled to a platform-neutral intermediate code by a front-end, and then from the intermediate code to the target machine code by the back-end. The intermediate code may have scrambled the source code enough so that it is hard for the back-end to recognize simple patterns like (x < y)? x : y and perform pin-hole optimization.

Finally, the programmer who wants the performance guarantee would probably just write in-line assembly for each target machine architecture. This is just another fine example where, if you want to maximize the ability of hardware, you still need to write code for that particular architecture.

Tuesday, April 27, 2010

C++ void template parameter

In C++, void is special. It is an incomplete type, so sizeof(void) is unknown. To maintain compatibility with C, it gives special consideration to void as the sole function argument type if the argument is not assigned to a variable, so the following declarations are equivalent.
int foo(void) { return 42; }
int foo() { return 42; }
However, when void is used as template parameter, and subsequently used as function argument type, GCC rejects it. Consider this class:
template<typename Tp>
class foo {
  int operator()(Tp) { return 42; }
Instantiating foo<void> results in error: invalid parameter type ‘void’.

However, some idioms such as the identity function should work for void. Consider this definition for identity:
template<typename Tp>
class identity {
  Tp operator()(Tp x) { return x; }
Obviously, even when manually substituting Tp for void, the code would not work because variable x would have an incomplete type. However, the idiom can be made consistent again in the case of void using template specialization.
class identity<void> {
  void operator()(void) { return; }
And you would get identity<void> to behave as expected.

However, what this means is that you'd likely duplicate the definition for a lot of code. It is probably easier to declare a true empty type. Following the convention of ML, we call it unit.
struct unit {};
It can be used with identity like this:
identity<unit> id;
In C++ (unlike C), the empty struct unit has sizeof(unit) == 1, but otherwise the type should behave the same as void. And the code for identity<unit> should behave the same as identity<void>.

Thursday, April 15, 2010

OpenSolaris notes

I've been using OpenSolaris dtrace facility to study certain application behavior. Coming from the Linux world, these are some notes that help me cope with the difference, as well as to allow me to repeat this experiment in the future.
  • Package manager: pkg search keyword; pkg install pkgname
  • Packages that I may need: SUNWnetcat, gcc-dev
  • Restarting SSH: /lib/svc/method/sshd restart
  • DTrace options to minimize trace losses:
    • -x switchrate=20Hz (tell DTrace to query kernel buffer 20 times a second)
    • -b 32m (set kernel buffer size to 32MB, the max)
    • -w (ignore systematic unresponsiveness)
  • Tip on preventing drop-out
    • Use gzip to compress trace output before it is written to disk.
    • Renice dtrace subject to nice 20. If this still does not work, use a computationally intensive compression algorithm like bzip2 and pin it on the same CPU as the subject. When a lot of traces come across, bzip2 will naturally slow the subject down by taking precedence in scheduling priority.
    • Command to set CPU affinity: pbind -b cpu# pid
  • Equivalent of ps -ax on Linux: ps -eo pid,tty,s,time,args
  • Equivalent of killall on Linux: pgrep pattern; pkill pattern

Monday, April 12, 2010

Various LZMA tools

Once upon a time there was gzip, and many people used it to create .gz files (also .tar.gz and .tgz). Then bzip2 came along, and it created .bz2 files (also .tar.bz2 and .tbz2). Nowadays, most Linux distributions have switched to a new compression algorithm LZMA (Lempel-Ziv-Markov chain algorithm), but it has several implementations. Finding the right implementation has been slightly confusing to me. LZMA has been implemented by XZ Utils, 7zip, and lzip. All of which have different container format.

XZ Utils is like the traditional gzip and bzip2 stream compressor. The file extension is .xz (also .tar.xz and .txz). The XZ Utils implementation is a successor of LZMA Utils, which uses a container format now considered legacy, but XZ Utils handles this legacy format. The legacy format is specified by the suffix .lzma (also .tar.lzma).

7zip is more like the .zip format, in the sense that an archive can contain multiple compressed files. The program also handles multiple formats, .zip as well as .tar.gz and .tar.bz2. Its Unix implementation is called p7zip. Use 7zip if you see .7z (also .tar.7z).

A lesser known implementation, lzip, is implemented in the style of gzip and zlib, and produces .lz files (also .tar.lz and .tlz). This is not to be confused with .lzma files produced by LZMA Utils, as these two formats are not compatible.

According to Google search today, there are 27,000 results for "tar.xz," 23,700 results for "tar.7z," 22,200 results for "tar.lzma," and only 2,200 results for "tar.lz." GNU tar recognizes suffixes for all the formats above except .7z, and assigns -J option for .xz files.

The following is an informal benchmark run. I compiled these programs using the default Makefile's compiler optimization flags without taking note of what they are, so I might be comparing apples and oranges.
$ time zcat hugefile.gz | 7za a -si hugefile.7z

real    20m2.066s
user    31m17.788s
sys     0m14.417s
7zip seems to have implemented multi-threading, and is able to scale to 150% CPU time.
$ time zcat hugefile.gz | xz > hugefile.xz

real    38m5.059s
user    38m6.863s
sys     0m10.712s
xv promises to implement multi-threading support in the future.
$ time zcat hugefile.gz | lzip > hugefile.lz

real    43m59.658s
user    43m50.838s
sys     0m7.064s
These are the resulting file sizes.
$ ls -l
total 699264
-rw------- 1 liulk grad2 147791612 Apr 12 23:56 hugefile.7z
-rw------- 1 liulk grad2 238094571 Apr 12 22:00 hugefile.gz
-rw------- 1 liulk grad2 111189035 Apr 13 00:09 hugefile.lz
-rw------- 1 liulk grad2 112989212 Apr 13 00:03 hugefile.xz
The difference is not that great. The implementations seem to be of comparable quality, sacrificing more time in order to achieve smaller files. My personal preference would be xz at the moment, by virtue that multi-threaded support is promised, that GNU tar has a dedicated option -J for the format, and that xz comes with a suite of utilities (unxz, xzless, xzcat, etc.) analogous to gzip counterparts (gunzip, zless, zcat, etc.).

Update (May 19): I just noticed today that lzip has a parallel implementation plzip that can scale to multiple processors. The timing result is as follows:
$ time xzcat hugefile.xz | plzip > hugefile.lz

real    6m46.324s
user    40m7.743s
sys     0m19.721s
It scaled to all the available CPU on the shared computing node I tested on.

Thursday, April 8, 2010

A curious trigonometry limit

It is comforting to know that I still remember how to do high-school level pre-calculus. I was intrigued by the question "how to calculate the area of a regular convex polygon if we know the circumradius?" It turns out to be an easy exercise. Then I remembered that when the number of sides of a regular convex polygon increases to infinity, it becomes a circle. I decided to investigate this "limit" and found an interesting derivation.
\[ \lim_{x→0} {\sin{cx} \over x} = c \] which is a generalization of a well know trigonometry limit: \[ \lim_{x→0} {\sin{x} \over x} = 1 \]
The complete proof can be found in a short paper titled "A Curious Trigonometry Limit." (I'm also just beginning to learn how to use PGF and TikZ in LaTeX).

Tuesday, April 6, 2010

Blocking SSH Brute Force Attack

Stop looking for unreliable solution like fail2ban to block someone from brute-forcing the SSH login to your server. If you configure your sshd_config correctly, blocking brute-force attempt is remarkably simple. And this will not obstruct your ability to login interactively by entering a password. The method works on most systems. Simply make sure the following setting are written in /etc/sshd_config:
  • UsePAM yes
  • ChallengeResponseAuthentication yes
  • PasswordAuthentication no
When you use command line ssh to login, the ssh client typically uses keyboard-interactive authentication, rather than password authentication. Even PuTTY correctly falls back to keyboard-interactive authentication mode when you disallow password authentication. I don't know of any instance where password authentication is actually used, except by bots to perform brute force attack.

I discovered this because I noticed the difference of the log entry written to /var/log/secure.log when someone tries the brute force attack, and when I typed in the password incorrectly. If I tried the wrong password, the log looks like this:

Apr  6 20:00:51 kawazu sshd[6163]: Failed keyboard-interactive/pam for invalid user oracle from port 54021 ssh2

But when a bot does it, the log looks like this:

Apr  6 18:14:11 kawazu sshd[5172]: Failed password for invalid user oracle from port 54865 ssh2

This is an indication that a scriptable mechanism, like Paramiko, is used to conduct a brute-force attack.

That said, after configuring sshd_config the way I mentioned above, if you need to allow Paramiko to connect to your server, you will have to use publickey authentication.

Thursday, April 1, 2010

xkcd Tribute

Save the following as a Makefile in some directory, and have fun.
.PHONY: me a sandwich

        @if [ $(shell id -u) -eq 0 ]; then \
          echo "Okay."; \
        else \
          echo "What? Make it yourself."; \
Happy March 32nd.

C++ default or copy constructor

If you implement a lot of data structure in C++, at one point you'd end up writing code that looks like this:
template<typename Tp>
class node {
  node(const Tp& value) : value_(value), next_(NULL) {}

  Tp value_;
  node<Tp> *next_;
The problem with this construct is that we rely on the existence of copy constructor for Tp. The author of the node class probably wanted something like this:
template<typename Tp>
class node : public Tp {
  node(...) : Tp(...), next_(NULL) {}

  node<Tp> *next_;
Unfortunately, C++ does not allow a subclass to inherit constructors, nor does it let us pass a variable number of arguments to the parent class constructor. A compromise is, we mandate that class Tp has a default constructor, like this:
template<typename Tp>
class node : public Tp {
  node() : Tp(), next_(NULL) {}

  node<Tp> *next_;
However, most templates in STL would define the node this way:
template<typename Tp>
class node {
  node(const Tp& value = Tp()) : value_(value), next_(NULL) {}

  Tp value_;
  node<Tp> *next_;
The benefit is that if Tp has a copy constructor, one would be used to construct private member value_; and if Tp has a default constructor, that would be used. This form relaxes the requirement that Tp would only need to provide either a default or a copy constructor.