Friday, October 8, 2010

A test on gcc's ability to optimize

I have an implementation of some singly linked data structure which I wish to adapt for owner reference. An owner reference is like a "parent" pointer in the case of a binary search tree, or the "previous" pointer in the case of a doubly linked list, i.e. a back pointer. However, the owner reference does not point to a node.

To understand what the owner reference is, I've written about tracking object ownership before, where we use a linear pointer to keep track of who is the owner of that object. Each object would have exactly one pointer that points to the object, and whoever has that pointer is the owner. The owner is responsible for disposing the object.

The principle problem with linear pointer is that it cannot be used to implement doubly linked data structure by definition, since doubly linked data structure implies that a node could have two incoming pointers. In a situation like that, only one pointer can be the owner. The other pointer must be a loan.

You can implement a doubly linked data structure with a linear pointer and a loan pointer, but we have to realize that the chief reason to have doubly linked data structure is so we can easily remove a node, knowing only its address, in O(1) time without needing to look up where it is in the data structure.

The owner reference in a node is a device to tell the address of the linear pointer that is the node's owner. With owner reference, it is possible to borrow an object and then steal ownership of it from the previous owner.

Using owner reference is preferred in case of binary tree node that requires in-place removal. If we use a parent node pointer, we would need additional logic to figure out whether the node we want to remove is the left or the right child of its parent. With owner reference, we don't care; the parent could even be the root pointer. The owner reference leads us right to the correct pointer that we need to modify.

However, the owner reference is additional book-keeping that needs to be updated whenever we use another routine to change the structure, such as list reversal. In the case of binary search tree, I would need to quickly augment a splay tree I implemented before with owner references. Obviously, owner reference changes whenever a linear pointer move occurs, and in C++ I could fortify the assignment operator of my linear pointer to take care of that for me. This would allow the same code that previously dealt only with singly linked data structure to become owner reference aware. However, the code could be doing extra work.

To illustrate, suppose we have a move function that supports owner reference update when doing assignment.
class node {
  node *next() const throw() { return next_; }
  node *& next() throw() { return next_; }

  node **owner() const throw() { return owner_; }
  node **& owner() throw() { return owner_; }

  node *next_;
  node **owner_;

void move(node *& to, node *& from) {
  to = from;
  to->owner() = &to;
  from = 0;
The move assignment updates the pointers linearly and keeps the owner reference updated. Once you move ownership from one pointer to another, the original pointer would lose the pointer to the node, and that's why it's filled with a NULL pointer, or 0. The owner would be the new "to" pointer.

Now consider the swap function.
void swap_naive(node *& a, node *& b) {
  node *tmp;
  move(tmp, a);
  move(a, b);
  move(b, tmp);
Notice that we can achieve the same effect with much less work.
void swap_fast(node *& a, node *& b) {
  node *tmp = a;
  a = b;
  b = tmp;
  a->owner() = &a;
  b->owner() = &b;
The question I ask in the subject of this post is, if we write swap_naive, is GCC smart enough to emit code in the fashion of swap_fast? If it can, then I could merrily overload the assignment operator of an augmented linear_ptr<> class to update owner references. However, this means I will not be able to write swap_fast() directly. If not, that means I'll need to manually optimize away the redundant owner reference updates.

I compiled with g++ -S -O3 -fomit-frame-pointer with different gcc versions, and here is the result for the number of essential instructions (not counting boilerplate context saving and return) for either function.
Compiler (output)swap_naive()swap_fast()
gcc-4.4.1 -march=i68611 (9)9 (8)
gcc-4.4.1 x86_6477 (6)
apple gcc-4.2 -arch i6861510
apple gcc-4.2 -arch x86_64118
apple gcc-4.2 -arch ppc128
apple gcc-4.2 -arch ppc64128
The reason I decided to do x86_64 and ppc/ppc64 is because these architectures have more register count than i386, and therefore is able to retain more data in the register rather than needing to spill data to the stack frame.

It is definitely surprising how gcc-4.4.1 on x64_64 can achieve the same number of instructions for both swap_naive() and swap_fast(), although you can see from the assembly listing that the naive version does assigns zero to the next_ pointer. And closer inspection in the assembly of swap_fast() indicates that the movq (%rdi), %rdx instruction is redundant because we had movq %rdx, (%rdi) before. The compiler could have saved us one more instruction, but the optimization has a bug.

Regarding optimization bug, a closer inspection on gcc-4.4.1 on i686 assembly code reveals that the subl  $16, %esp and addl $16, %esp instructions are redundant. This means, with proper optimization, we could bring instruction count for swap_naive() down to 9 as well. Again, in swap_fast(), the movl (%eax), %ebx instruction is redundant, which brings the instruction count down to 8.

The compiler optimizer is indeed impressive, but not without flaws.

After this study, I determined that it is better to not blanket overload the assignment operator. I would modify singly linked routines to use move(), which is identical to simple assignment in singly linked data structures. However, for augmented linked structure, it will be possible to manually optimize some of these routines further by not using move(). For example, I could leverage a singly linked list sort() to sort an augmented list with owner reference, but I only relink the owner reference after sorting is done. This would not be possible had assignment operator been overloaded.

No comments: