Sunday, November 14, 2010

Personal notes on selected ISMM 2010 papers

For Optimizations in a private nursery-based garbage collector,
  • Existing parallel generational garbage collector uses per-thread nursery, but after a minor collection, the per-thread nurseries may be allocated to another thread, causing massive false sharing. This is revealed by VTune.
  • This behavior is especially problematic because nursery cycles through quickly when the program allocates a lot of objects that quickly become garbage.
For Efficient memory shadowing for 64-bit architectures,
  • Shadow memory is used to store information about an application address, typically used by a memory analysis tool like Valgrind to answer questions like: is it allocated; is it owned by a particular thread; is it locked; etc.
  • Direct mapping stores information also in the application's address space. The information is stored at address * scale + displacement. Only works if the address space doesn't have holes.
  • Segmented mapping is a special case of direct mapping. The shadow memory is only big enough to store an indirect pointer where information can be retrieved. For 64-bit address, may use several levels of indirect tables, like how page table is organized. Indirection causes the segmented mapping to be several times slower than direct mapping.
  • The paper proposes an approach that allows several possible displacements (not too many). The displacement is chosen to guarantee no false hits. Addresses range that may cause false hits are reserved.
  • This technique may possibly be used by memory allocators as well. Increasing number of allocators have built-in memory analysis.
For Memory, an elusive abstraction,
  • On a multi-processor system, the ordering of memory read and writes, as a side-effect of processor design, may become an observable artifact by a program running on another processor. Cache coherence protocol can be designed to hide that artifact with slow-down, but Lamport observes that some applications can trade performance with lack of consistency.
  • Relaxed consistency needs precise documentation in order to reason about correctness of concurrent programs. However, processor vendors want to keep the specification vague.
  • My own observation: why not assume no consistency? In addition to memory barriers, the instruction set should provide a "write-through if cache-line is shared" instruction (if cache line is not shared, obviously we don't have to worry about consistency).
For The locality of concurrent write barriers,
  • Although I prefer applications that forego garbage collection for performance and predictability, this paper has a nice description of how concurrent garbage collection works, and a comparison of variations.

Thursday, November 11, 2010

Continuation passing style calling convention for a C-like language

There have been two strong indications where continuation passing style calling convention is necessary for a C-like language.
  • Although user-space thread library exists, it still could not truly implement lightweight processes. The problem is not about kernel-level context switching versus user-level context switching anymore, since both could be equally efficient. The problem is the stack requirement of any thread. Even user-level threads must reserve a large amount of space space (2MB), making it impossible to start thousands of threads unless you use 64-bit address space. The problem with 64-bit address space is that it doubles the storage requirement for pointers.
  • Parallel processing framework such as Cilk goes through great lengths trying to imitate continuation passing style, using Duff's device in generated C code so that execution could "resume" in the middle of a function. This complicates debugging as well as the scheduler of the runtime system.
Inspired by LLVM calling conventions and suggestions for continuation passing style and tail-call optimization, I propose a continuation passing style (CPS) calling convention for a C-like language, where:
  • A code block may be expressed like a function that returns void *.
  • All arguments are passed by register. Additional arguments must be spilled on a memory location, and the last register will be a pointer to the spilled location. The number of registers is machine-dependent, but at least two arguments are allowed.
  • Variadic argument must be spilled in memory as a variable length array.
  • One register is reserved for stack pointer, which must be restored to its original value when the control flow leaves the code block.
  • The stack is useful for calling a conventional C function, or for spilling block-local variables. Scope-local variables that span multiple blocks must be spilled to a heap-allocated location.
  • A code block transitions into another code block using a tail-call with a return statement.
  • It is possible to return from a continuation passing style code block. Stack pointer is restored, a void * value is prepared as the return value, and the function returns like a conventional C function (e.g. pops return address from stack, or branches to return address stored in a register, depending on the machine's native calling convention).
  • Only the stack register (and on some architecture, also the return address register) is callee saved. The caller must save all other registers.
Here is a matrix of caller-callee interaction. There are three types of caller: CPS tail-call, CPS non tail-call, and C. There are two types of callee: CPS and C.
  • Caller is CPS tail-call, callee is CPS: arguments are passed in registers, then branches to callee.
  • Caller is CPS tail-call, callee is C: behaves like a regular tail call from a C caller.
  • Caller is C, callee is CPS: all caller registers are saved, arguments are passed in registers, and a call is made as if the CPS callee is a C callee taking no arguments (i.e. return address may be stored on the stack or in a register, depending on the machine architecture).
  • Caller is CPS non tail-call, callee is CPS: a warning may be issued, but otherwise behaves as if the caller is C and callee is CPS.
  • Caller is CPS non tail-call, callee is C; or caller is C, and callee is C: behaves like regular C function call.
LLVM requires both caller and callee use the same calling convention. But unlike LLVM, we anticipate that C function will want to call CPS, and vice versa. It is intentional that CPS calling convention must be a subset of C calling convention.

The language will be used as an intermediate language for compiling higher level language such as ML, Cilk, etc. A typical function in the higher level source language will be broken down to more than one CPS code blocks.

For a source language that wants to use CPS for lightweight processes, it could be necessary for the lightweight processes to reuse the stack of the worker thread, and that context switching must require the stack to be pristine. The lightweight process may only yield to the scheduler when the stack is empty. This condition may be checked in the run-time, or enforced statically at compile time by having a "pristine" attribute (stack is pristine). A pristine CPS callee may only be called by a pristine CPS tail-call caller.

Monday, October 18, 2010

C++ Overloading operator<< for std::ostream

The operator<< is used to stringify an object and "shift" the result to an output stream, such as std::cout for standard output. It is used like this:
#include <iostream>
int main() {
  std::cout << "hello world" << std::endl;
  return 0;
}
This of course only works for values of types for which there is an overloading of the operator<<. If you want to implement serialization for a new class, you will have to implement an operator<<() in the global scope, and cannot do so in the class scope. That is because the left hand side of << is an std::ostream& object, not your object.

For example, this is right:
class Foo {...};

std::ostream& operator<<(std::ostream& os, const Foo& foo) {
  ...
}
But this will not result in the intended notation:
class Foo {
  ...

  std::ostream& operator<<(std::ostream& os) {
    // stringify this object to output stream os.
    ...
  }
};
This means that a statement like foo << std::cout will stream foo to std::cout, but the notation is now wrong. You can overload operator>> instead, but this notation has other issues we will not discuss here.

Sometimes we do not want to overload operator<< globally, at least we don't want to pollute the global namespace with a different operator<< for each serializable class. We can do this by introducing an indirection layer with subtyping polymorphism.
class serializable {
 public:
  virtual void serialize(std::ostream& os) const = 0;
};

std::ostream& operator<< (std::ostream& os, const serializable& s) {
  s.serialize(os);
  return os;
}

class Foo : public serializable {
 public:
  virtual void serialize(std::ostream& os) const {
    os << "Foo" << std::endl;
  }
};
Then we can simply say:
int main() {
  Foo f;
  std::cout << f;
  return 0;
}
We implement exactly one global operator<< trampoline, which calls the serialize() method of a serializable abstract base class. Any subclass of a serializable that implements the serialize() method can now also be used with operator<< automatically.

If you don't want to use virtual table, a similar technique using Curiously Recurring Template Pattern also works.
template<class Derived>
class serializable {
 public:
  void serialize(std::ostream& os) const {
    static_cast<const Derived *>(this)->serialize(os);
  }
};

template<class Derived>
std::ostream&
operator<< (std::ostream& os, const serializable<Derived>& s) {
  s.serialize(os);
  return os;
}

class Foo : public serializable<Foo> {
 public:
  void serialize(std::ostream& os) const {
    os << "Foo" << std::endl;
  }
};

We can similarly define unserializable class with operator>> analogously. This is left as an exercise for the reader.

Saturday, October 16, 2010

A rather unfortunate IP packet corruption

$ dig wwwapps.upsprodcidr2.com.akadns.net.

; <<>> DiG 9.4.3-P3 <<>> wwwapps.upsprodcidr2.com.akadns.net.
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 48565
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;wwwapps.upsprodcidr2.com.akadns.net. IN        A

;; ANSWER SECTION:
wwwapps.upsprodcidr2.com.akadns.net. 457319 IN CNAME wwwap`s2.ups.com.edgekey.net.

;; Query time: 15 msec
;; SERVER: 10.0.5.1#53(10.0.5.1)
;; WHEN: Sat Oct 16 10:55:53 2010
;; MSG SIZE  rcvd: 92
Barring the ridiculous time to expire 457319, apparently the CNAME record above should read wwwapps2.ups.com.edgekey.net. The wwwapps.upsprodcidr2.com.akadns.net is used as a CNAME (after many indirections) to wwwapps.ups.com, which is used to track packages.

Friday, October 8, 2010

A test on gcc's ability to optimize

I have an implementation of some singly linked data structure which I wish to adapt for owner reference. An owner reference is like a "parent" pointer in the case of a binary search tree, or the "previous" pointer in the case of a doubly linked list, i.e. a back pointer. However, the owner reference does not point to a node.

To understand what the owner reference is, I've written about tracking object ownership before, where we use a linear pointer to keep track of who is the owner of that object. Each object would have exactly one pointer that points to the object, and whoever has that pointer is the owner. The owner is responsible for disposing the object.

The principle problem with linear pointer is that it cannot be used to implement doubly linked data structure by definition, since doubly linked data structure implies that a node could have two incoming pointers. In a situation like that, only one pointer can be the owner. The other pointer must be a loan.

You can implement a doubly linked data structure with a linear pointer and a loan pointer, but we have to realize that the chief reason to have doubly linked data structure is so we can easily remove a node, knowing only its address, in O(1) time without needing to look up where it is in the data structure.

The owner reference in a node is a device to tell the address of the linear pointer that is the node's owner. With owner reference, it is possible to borrow an object and then steal ownership of it from the previous owner.

Using owner reference is preferred in case of binary tree node that requires in-place removal. If we use a parent node pointer, we would need additional logic to figure out whether the node we want to remove is the left or the right child of its parent. With owner reference, we don't care; the parent could even be the root pointer. The owner reference leads us right to the correct pointer that we need to modify.

However, the owner reference is additional book-keeping that needs to be updated whenever we use another routine to change the structure, such as list reversal. In the case of binary search tree, I would need to quickly augment a splay tree I implemented before with owner references. Obviously, owner reference changes whenever a linear pointer move occurs, and in C++ I could fortify the assignment operator of my linear pointer to take care of that for me. This would allow the same code that previously dealt only with singly linked data structure to become owner reference aware. However, the code could be doing extra work.

To illustrate, suppose we have a move function that supports owner reference update when doing assignment.
class node {
 public:
  node *next() const throw() { return next_; }
  node *& next() throw() { return next_; }

  node **owner() const throw() { return owner_; }
  node **& owner() throw() { return owner_; }

 private:
  node *next_;
  node **owner_;
};

void move(node *& to, node *& from) {
  to = from;
  to->owner() = &to;
  from = 0;
}
The move assignment updates the pointers linearly and keeps the owner reference updated. Once you move ownership from one pointer to another, the original pointer would lose the pointer to the node, and that's why it's filled with a NULL pointer, or 0. The owner would be the new "to" pointer.

Now consider the swap function.
void swap_naive(node *& a, node *& b) {
  node *tmp;
  move(tmp, a);
  move(a, b);
  move(b, tmp);
}
Notice that we can achieve the same effect with much less work.
void swap_fast(node *& a, node *& b) {
  node *tmp = a;
  a = b;
  b = tmp;
  a->owner() = &a;
  b->owner() = &b;
}
The question I ask in the subject of this post is, if we write swap_naive, is GCC smart enough to emit code in the fashion of swap_fast? If it can, then I could merrily overload the assignment operator of an augmented linear_ptr<> class to update owner references. However, this means I will not be able to write swap_fast() directly. If not, that means I'll need to manually optimize away the redundant owner reference updates.

I compiled opt.cc with g++ -S -O3 -fomit-frame-pointer with different gcc versions, and here is the result for the number of essential instructions (not counting boilerplate context saving and return) for either function.
Compiler (output)swap_naive()swap_fast()
gcc-4.4.1 -march=i68611 (9)9 (8)
gcc-4.4.1 x86_6477 (6)
apple gcc-4.2 -arch i6861510
apple gcc-4.2 -arch x86_64118
apple gcc-4.2 -arch ppc128
apple gcc-4.2 -arch ppc64128
The reason I decided to do x86_64 and ppc/ppc64 is because these architectures have more register count than i386, and therefore is able to retain more data in the register rather than needing to spill data to the stack frame.

It is definitely surprising how gcc-4.4.1 on x64_64 can achieve the same number of instructions for both swap_naive() and swap_fast(), although you can see from the assembly listing that the naive version does assigns zero to the next_ pointer. And closer inspection in the assembly of swap_fast() indicates that the movq (%rdi), %rdx instruction is redundant because we had movq %rdx, (%rdi) before. The compiler could have saved us one more instruction, but the optimization has a bug.

Regarding optimization bug, a closer inspection on gcc-4.4.1 on i686 assembly code reveals that the subl  $16, %esp and addl $16, %esp instructions are redundant. This means, with proper optimization, we could bring instruction count for swap_naive() down to 9 as well. Again, in swap_fast(), the movl (%eax), %ebx instruction is redundant, which brings the instruction count down to 8.

The compiler optimizer is indeed impressive, but not without flaws.

After this study, I determined that it is better to not blanket overload the assignment operator. I would modify singly linked routines to use move(), which is identical to simple assignment in singly linked data structures. However, for augmented linked structure, it will be possible to manually optimize some of these routines further by not using move(). For example, I could leverage a singly linked list sort() to sort an augmented list with owner reference, but I only relink the owner reference after sorting is done. This would not be possible had assignment operator been overloaded.

Friday, October 1, 2010

Starting daemon process using screen

Here is a good way to start a daemon background process using screen.
screen -d -m -L command args...
This tells screen to create a new session (-m), log its output (-L) and detach immediately (-d), which the command args... is run. The log is written to screenlog.n in the current directory, where n is the "window" number inside the screen session. The logging is buffered and flushed periodically, and the log file can be logrotated.

The screen session supports process control, i.e. stopping a running daemon. To do that, the screen session must be started with a session name (-S sessionname), and it can be killed later with the name.
screen -S pingsession -d -m -L ping localhost
sleep 20
screen -r pingsession -X quit
The advantage is that the daemon process does not have to do anything to turn itself into a daemon or to support process control. A list of screened daemons can be shown using screen -list command.

Saturday, September 18, 2010

What To Do If You Have Only One IP Address?

Back in the day when I lived with my parents and we first got dial-up, we're already sharing that single internet connection between two computers, mine and my brother's. The modem is connected to my computer, and I setup HTTP proxy and SOCKS proxy using WinGate. My brother's computer is connected to my computer using a crossed Ethernet cable. We had a small LAN thanks to our involvement in LAN parties back in high school.

When we got DSL, I retired WinGate, and used the NAT masquerading capability built into the Linux 2.4 kernel to share the Internet connection. Although NAT is much more transparent (no cumbersome proxy configuration), we could only have one computer play online at a time because the game servers would get confused if it sees two clients coming from the same IP address. These game servers are definitely more lenient nowadays, thanks to the proliferation of DSL sharing.

At that time, I also started playing around with running various kinds of servers: web servers and game servers. I learned that you have to setup port forwarding at the NAT gateway if you want the server to be reachable from the Internet. Port forwarding allows me to forward all connections to this particular port on the Internet side of the gateway to one computer on the LAN. It cannot forward one port to two computers. If I have two computers running the same type of server, wanting the same port to be forwarded to both of them, I would be in trouble.

What if I do want to run web server for two or more websites? For web servers I have several options:
  • Run two servers and forward two ports, port 80 to one computer, and 81 to another. The URL would have to specify the port number explicitly, e.g. http://www.example.com:81. This is admittedly ugly.
  • Run just one server, and use virtual hosting. This is a feature of HTTP/1.1.
  • I could forward port 80 to a reverse HTTP proxy, and let the reverse proxy decide how to handle the traffic. Some may go to one server, and some goes to the other.
However, options are very limited for other kinds of servers, especially game servers. They often insist on using a fixed, non-configurable port number as well. The situation gets a bit hairy during sibling rivalry when both me and my brother want to run our own game server, and neither want to participate in the other one's. Of course, me being the network admin, I get the unfair upper hand.

The lesson is that, if you want to run multiple instances of certain servers, you will need multiple IP addresses. The problem is that IP addresses are suffering shortage, and it's running out. Also, residential Internet users only get one IP address. Only business users get multiple static IP assignments, which costs extra money.

Fast forward ten years. Game servers are still as stubborn as ever, requiring the use of a fixed port number. However, other services are becoming more flexible, allowing the use of any port numbers (it used to be the case that port number uniquely identifies a service). Some examples are XMPP and SIP. These services use the SRV record configured in a DNS server to locate the port and IP address. It would be nice if HTTP, probably the most popular server type on the Internet, could become SRV-aware one day.

Meanwhile, more peer to peer technologies such as Skype, both voice and video, have been developed. Peer to peer sometimes requires both parties to be directly reachable on the Internet, and certainly requires at least one party to be reachable. However, recall that computers sharing Internet connection behind NAT needs special setup to be reachable, the proliferation of shared home DSL is a big headache for these peer to peer technologies.

In order for peer to peer technologies to work when both parties are behind NAT, more technologies like UPnP and STUN have been developed. UPnP allows a program running on a LAN computer to find the gateway (which needs to be UPnP aware, certainly not the case for my home-brew Linux NAT box 10 years ago) and add dynamic port forwarding; it works for TCP connections. STUN uses clever packet header forging to puncture holes for UDP connections. Otherwise, a third party relay that is reachable by both peers is required.

The third party relay certainly makes the "peer to peer" part of the technology moot. For the service operator, this means they require additional network and computing resource to run the relays. For end users, they are asked to trust the third party to not misuse the content of the communication that is supposed to involve only two parties in the first place. The only real solution is to have both peers have their own IP addresses. Again, this is not possible.

Until now.

More and more ISPs are deploying IPv6, which solves the address shortage issue. Even if an ISP does not provide IPv6, you could often use a free tunnel service that gives you IPv6 connectivity tunneled over the old IP, which is now called IPv4. Airport Express has an advanced option to configure IPv6 tunneling. If your ISP provides a tunnel, Airport Express can pick it up automatically using RFC 3068. If not, it will use a tunnel broker provided by Sprint. The end result is that Airport Express gives each device on the LAN a publicly addressable IPv6 address using the IPv4 address concatenated with the MAC address of the device. One perk of using IPv6 is that you'll see the Kame turtle dancing.

Nowadays, I use Comcast business class network with 1 static IP address (courtesy to my employer who pays for it). Using Airport Express, I'm happy with the fact that each computer on my LAN now has a publicly addressable static IPv6 address. I look forward to more IPv6-only peer to peer technologies being developed, technologies that are free from ugly workarounds such as UPnP and STUN.

Thursday, August 26, 2010

Killing the eBay sniper

When it comes to online auction, a sniper is a bidder who makes his bid at the last second with the intention to outbid the current highest bidder without giving him a chance to bid again. Here is an example of a sniper in action. Notice that the sniper (r***a), contrary to the name sniper may imply, is not actually very discreet. He started showing interest about an hour before the auction ends, but could not become the highest bidder until the last second.

We can see that f***f, the person who got outbid by the sniper, only bid once. He's a traditional eBay bidder who takes advantage of the so-called proxy bidding. Proxy bidding lets the winning bidder pay the amount bid by the second highest bidder plus a small increment, the least amount the winner bidder would have to bid in order to outbid the second person. While proxy bidding saves f***f tons of trouble, when he got sniped by r***a in the last second, he didn't have a chance to decide whether he wants to bid higher. Although one could argue that if f***f really wanted to pay more, he should have raised the first bid. However, a person also makes a bid to gauge interest level from fellow bidders. A sniper, by virtue of his operation, conceals his interest until the last second to keep the interest artificially low.

There is one thing eBay can do to discourage sniping: every time when a bid is made, whether it outbids the current highest bidder or not, the auction end time is extended by an hour. The sniper r***a in this case would have extended the end time by 20 hours. Another orthogonal policy adjustment is to limit the number of bids each bidder can make to a low number, say 5. This encourages more bidders to take advantage of proxy bidding, as well as limit the amount of time a bidder can extend the auction end time by bidding.

eBay, I hope you're listening.

Friday, August 13, 2010

The Equal Sign

I've long wanted to write about the equal sign but I've been putting it off. Now I have an excuse to write about it. It is reported that students don't understand what the equal sign means. I don't blame them because the symbol '=' is overloaded. It is given four meanings in grade-school curriculum.

The first meaning is reduction of computation, as in arithmetics. An example is 2 × 5 + 4 × 3 = 10 + 12 = 22. In this case, the reduction from a formula 2 × 5 + 4 × 3 to a value requires us to compute 2 × 5 = 10 and 4 × 3 = 12 first because of operator precedence, and then compute 10 + 12 = 22. When arithmetics is taught, the equal sign is taken to mean reduction. We kind of gloss over the fact that there is a specific order of reduction, but the order of reduction was never formally taught.

The second meaning of the equal sign, by the time you learn algebra, is constraint solving. You are asked to solve for the value of the variables given a system of constraints, for example, 2x - y = 7 and x + y = 5. The answer is x = 4 and y = 1. Here the equal sign represents a goal, and you are asked to give solution for the goal.

The third meaning of the equal sign, if you haven't flunk out of math class yet, is definition. You would learn in pre-calculus about how to define a function, e.g. f(x) = x2 - 3x + 7. The formula x2 - 3x + 7 has been assigned a name f(x), and there is some implicit notion of variable scoping and substitution. For example, if you were to compute f(4) - 5, you would first reduce the function f(4) to its definition x2 - 3x + 7, and then substitute x for 4, which gives you 42 - 3 × 4 + 7 = 11. Then you compute 11 - 5, which gives you 6. The equal sign here allows you to give a long formula a short name.

Finally, we encounter the fourth meaning in a very subtle way when we talked about constraint solving and function definition. That is, we also use the equal sign to denote a substitution. For example, you can verify that the system of constraints 2x - y = 7 and x + y = 5 yields the solution x = 4 and y = 1 if you substitute x for 4 and y for 1, so that 2 × 4 - 1 = 8 - 1 = 7, and 4 + 1 = 5. And to compute f(4), you will compute x2 - 3x + 7 in the context of x = 4 because of the function definition.

All four notions of the equal sign are different, but they all use the same symbol. That's why students get confused. And it takes a computer scientist to tell a mathematician about that!

Wednesday, July 21, 2010

Rate difference means jitter

Movies on film are shot at 24 frames per second (fps), and television (in the US) is broadcast at approximately 60 fields per second, or 60i where the i suffix means "interlaced" where one frame shows only the even horizontal scan lines, and the next one shows the odd scan lines. The antonym of "interlaced" is "progressive" which means the whole picture is shown for every frame. Movies on film are progressive. The problem to adapt movies for show on a television is known as Telecine. Film and TV engineers have always known that telecine could introduce jitter. The jitter is not so much due to the difference between progressive and interlaced, but due to the difference in frame rate, since 24 fps does not divide 60 fps.



Frames are supposed to be shown at a constant rate, which means a fixed interval. However, when you adapt one frame rate to another one, a frame would be shown for a longer duration than the next one. In the figure above, the green frame from the 24 fps source spans 3 frames in 60 fps, but the blue frame only spans 2 frames.

This difference causes the motion to appear jerky.

Of course, this problem happens when you perform constant rate sampling and wish to adapt the sampling to another rate. When it comes to a stream of wave-form such as audio, an interpolation scheme tries to "guess" the curve of the wave-form and reapply sampling to the guessed curve. This introduces interpolation error if our guesstimate is off.

How does it relate to computer science? When you have two fixed-rate link layer network and try to connect them, the rate difference causes jitter.

Thursday, July 1, 2010

C++ enum default constructor

Every C++ enumeration type has a default constructor. Suppose we have the following enum specifier:
enum thing {
  foo = 13, bar = 31,
};
what would the following program print? The answer may be surprising.
#include <iostream>

int main() {
  thing x = thing();  // invokes enum's default constructor.
  std::cout << "x = " << x << std::endl;
  return 0;
}
It turns out that the code above will print x = 0, which is the integer value of neither foo nor bar. The reason is that enum is considered plain old data (ISO/IEC 14882 2003 section 3.9 clause 10), and all POD type have default initializer that initializes the value to zero (ISO/IEC 14882 2003 section 8.5 clause 5).

When dealing with an enum type value, we must always consider the possibility of 0 even if it was not one of our declared choices.

Monday, June 21, 2010

C++ STL std::vector capacity() Not A Substitute For size()

When you use the std::vector<> template in C++, one thing you notice is that the actual capacity(), the number of elements allocated for the internal array, may be different than size(), which is the number of items currently stored by the user. In general, size() <= capacity(). The capacity() can change when reserve() is called, and reserve() can be called by insert() or resize().

When accessing the vector using operator[], the index is not checked against valid bounds. The at() accessor defines valid bounds to be 0 <= i < size(), where i is the argument to at(), and the domain applies to operator[] as well. If you don't check for bounds explicitly, it should at least be an invariant of your code.

It is worthy to note that when size() <= i < capacity(), although your program technically does not access illegal memory (memory checker like Valgrind would not be able to detect it), the elements at those i's are strictly undefined, for the following reasons.
  • When the next resize() expands the size(), it will attempt to overwrite the elements in the internal array from the previous size() up to the new size().
  • insert(end(), ...) will overwrite the elements in the internal array after size().
  • Even if you avoid using resize() and never uses insert(), and rely solely on reserve() and operator[], there could still be a problem when reserve() causes reallocation of the internal array: only up to size() elements are copied from the original array to the new array.
In other words, the elements in the internal array from size() through capacity() are ignored by the vector and assumed to be non-existent. A symptom is that elements with values previously stored by operator[] gets erased to the default, or some other value specified for resize(). It is a bad idea to use capacity() for bounds checking. The proper bounds are defined by size().

Thursday, June 17, 2010

Lack of Lexical Context Closure Impediment to Abstraction

Lexical context (also called the environment) describes what identifiers need to be present as well as their types in order for a fragment of a program to run. Closure allows a first-class function to refer to its lexical context while being called in the lexical context elsewhere. This is useful for writing algorithms as a higher-order function whose behavior can be customized by a function argument. This is widely understood by functional programming community (LISP/Scheme, ML, Haskell) to be a very powerful abstraction tool.

It is only starting to be noticed by the C++ community. In C++, the std::for_each algorithm can be used to apply a function to elements in a container between two iterators. But only in C++0x can you use it without too much notational overhead.
// From Wikipedia: C++0x
std::vector<int> someList;
int total = 0;
std::for_each(someList.begin(), someList.end(), 
  [&total](int x) { total += x; }
);
std::cout << total;
The idea is that std::for_each would describe a specific kind of for-loop, and its usage makes the intent of the program more obvious. Note that the function argument (in bold) would be called inside std::for_each, which is in a different lexical scope than where the function argument is defined.

When compiling to machine code, a typical way to implement closure is to enclose the environment in a struct, let each function take a pointer to the environment as an implicit argument, and make the function pointer a pair of pointer to the code as well as pointer to the argument. In plain C, it would look like this:
typedef struct env {
  int total;
} env_t;

void anonymous_func(void *ctx, int x) {
  env_t *envp = (env_t *) ctx;
  envp->total += x;
}

typedef struct for_each_int_arg {
  void (*code)(void *, int);
  void *ctx;
} for_each_int_arg_t;

void for_each_int(int *first, int *last, for_each_int_arg_t f) {
  for ( ; first != last; first++)
    f.code(f.ctx, *first);
}

void sum(int *numbers, size_t n) {
  env_t env = { .total = 0 };
  for_each_int_arg func = {
    .code = anonymous_func,
    .ctx = (void *) &env,
  };

  for_each_int(numbers, numbers + n, func);
}
In fact, this technique is prevalent in C programs that take a callback in the form of a function pointer plus a context pointer. A profound implication for callbacks that do not take a context pointer is that the function argument to these higher-order functions can only refer to the global or static variable scope as its lexical context. Unless these functions do not effect external state (i.e. pure, effectless), this limitation makes it hard to encapsulate the state of the program into small reusable discrete units, and therefore a violation of abstraction.

Let's take a look at several C library calls and see what the implication means.
  • qsort(), bsearch(): the compar argument does not take context. It may be fine because compar is supposed to be pure, i.e. causing no side effects. However, neither qsort() nor bsearch() are very generic either; they assume the data to be laid out entirely in a C array. A more general version may require the compar function to take a context, for example when operating on the content of a huge file without reading the content entirely into memory.
  • atexit(): the function argument does not take context. If the exit routine is supposed to synchronize files, databases or remote processes, all the states would have to be global.
  • signal() and sigaction(): the signal handler does not take context, therefore they cannot be modularized. There are legitimate needs for modular signals, such as alarm and broken pipe detection. These will need to seek alternative workarounds.
  • pthread_key_create(): the destructor of a thread-specific key takes the thread-specific value as the context.
  • pthread_cleanup_push(): the cleanup routine takes a context.
  • pthread_create(): the start routine takes a context.
We can see that those library calls that don't take context in the callback can restrict the utility of these library calls.

In C++, the implicit this object pointer with a type parameter can be used together to compensate for the lack of context pointer, using a function object. The type parameter would specify a class that implements operator(), the function call operator. It will be given an implicit this pointer when called. This is how std::for_each worked when C++ did not support lexical context closure. You would have to rewrite the one line of anonymous function in C++0x into several lines of function object, incurring notational overhead that does not justify the axiomatization of a simple for loop construct.

Even so, there are cases where the lack of lexical context closure in C++ can impede abstraction and code reuse.

A while ago, I implemented a Splay tree where the pointer as well as the nodes are abstracted out.
template<class Ptr /* implements pointer<? implements binary_tree_node> */>
struct splay_tree /* implements binary_search_tree */ {
  template<class Compare /* implements unary_compare<Ptr> */>
  static ptr_type&
  find(Compare& cmp, ptr_type& root) throw() {
    ...
  }

  // More functions that operate on splay trees...
}
The idea is that whoever wants to use the Splay tree algorithm will write a node class that implements the binary_tree_node interface (provides accessor and mutator to left and right subtree pointers). The template argument Ptr is the pointer type to the node. The find() function takes a comparable object, which can either be a key or another node, which will be used to find a node in the tree that compares equal to the comparable object. The template argument Compare is the type of the comparable object.

Generalizing the pointer allows an alternative memory model to be used. In my case, I used the splay tree with linear_ptr<> which performs ownership tracking, so I know the algorithm preserves resource ownership. It has been used successfully this way. Another usage of alternative memory model is where a pointer is subscript into a particular array. This allows binary tree nodes allocated from a small array (say < 256 entries) to use just one byte for the pointer.

However, it doesn't work. We could try it like this:
template<typename Index, typename Tp, Tp *array>
class small_pointer {
 public:
  small_pointer(Index i = 0) throw() : i_(i) {}
  Tp& operator*() const throw() { return array[i_]; }
  Tp* operator->() const throw() { return &array[i_]; }
 private:
  Index i_;
};
A seasoned C++ programmer will correctly spot that array—a value of type Tp *— has to be a constant. It turns out that C++'s idea of constant means that the pointer must come from the memory location of a global or static variable. For example, you can't use small_pointer in a function like this.
void foo() {
  int numbers[10];
  for (int i = 0; i < 10; i++) {
    small_pointer<char, int, numbers> p(i);  // error!
    *p = i;
  }
}
C++ compiler will tell you that numbers cannot appear in a constant expression. However, if you think about it, after the code for operator*() is inlined, the statement *p = i would become numbers[i] = i, and it makes perfect sense. However, in the case that the C++ compiler decides to place opeartor*() into its own function, it would not have access to the int array because numbers is in another lexical scope. If C++ correctly supports lexical context closure, it would actually be able to accept any identifier as a pointer to a non-type argument of a template. You would be able to instantiate the small_pointer template class with a function scope identifier as well as a class scope identifier.

Without this abstraction, I would have to re-implement the Splay tree for tree nodes that want to use a compact pointer model because the nodes are stored in an array. This need might be more common than you think, given that embedded devices like Arduino is getting more popular these days. The difference between O(n) and O(lg n) operation can mean a lot to battery life, but O(n) algorithm is often being chosen due to the prohibitive complexity involved to get O(lg n) data structure working in a compact memory layout. I certainly do not look forward to re-implementing Splay tree for each possible compact node layout. This is something that the compiler should do for me.

Tuesday, June 15, 2010

Hazard in User Interface Design

Improper user interface design could lead to hazard, but hazardous user interface design is not limited to computers; it can happen in real life. In this case, we're talking about a traffic intersection.
Here is a photograph of the intersection. It looks innocuous enough.
However, upon closer look, there are two things to notice.
First, St. Marys St. is two way up until the overpass that crosses Mass Turnpike. It is indicated by the “DO NOT ENTER” and “ONE WAY” signs at the left corner. The traffic light hanging in the middle of the intersection indicates it's permitted to make a left or right turn. However, at the right corner, we see an non-directional green light.
The green light invites the intuition that it is okay to go straight, which is incorrect and inconsistent with the sign on the other side. A naive, optimistic driver may see the illuminated green light first because it is on the same side of the road and assume that it is okay to go straight. After all, through traffic is typical for most intersections. However, this particular intersection is unusual where three of the four ways of the intersection are bi-directional, but one of the four ways is unidirectional. The green light makes the atypical seem typical, and it encourages a driver to ignore other signs of unusual road condition.

One may argue that a good driver will take all road signs into account. However, a driver under stress, for example, one who is honked by a car behind, or a driver who is distracted by cellphones, could very well pick the wrong visual cue. The inconsistent traffic light is a hazard, and a flaw in the user interface design.

A proper redesign would replace the non-directional green light to left and right arrows, like one in the middle of the intersection. It could also keep the red light lit at all times just to be abundantly clear.

Thursday, June 10, 2010

Finding n Consecutive Bits of Ones in a Bit Vector in O(lg n) Time

We want to find out whether a bit vector X of length m has n consecutive bits of ones, as well as all the positions where we can find it.

For example, the following bit vector:
(msb) 11111111 01111111 00111111 00011111 (lsb)
Has 5 consecutive ones starting at bit 0, 6 consecutive ones at bit 8, 7 consecutive ones at bit 16, and 8 consecutive ones at bit 24. In addition, we count 4 consecutive ones at bit 1, 3 consecutive ones at bit 2, and so on. The overlaps are also considered. Therefore, if we were to find all positions with 6 consecutive ones, we could generate a positional bit mask like this:
(msb) 00000111 00000011 00000001 00000000 (lsb)
The result indicates that we can find 6 consecutive ones in the original bit vector if we start at bits 8, 16, 17, 24, 25, 26.

Of course, if we can only access one bit at a time, the best running time we can do is O(mn). Most computers provide bitwise operators to compute logical AND and OR on the matching bits of two bit vectors simultaneously, as well as bit shift operators. This, as we show, will dramatically reduce our running time.

Our first attempt is to simplify the problem of finding two consecutive ones, that is, bit i should be set if both bit i and bit i + 1 are ones. This is easy. We shift X to the right by one bit, and AND it against the original unshifted X. In C, this is written as X & (X >> 1). We can generalize the first attempt to find n consecutive ones by doing X & (X >> 1) & (X >> 2) & ... & (X >> (n - 1)). This naive algorithm takes O(n) time.

The part that gets really interesting is when we observe that 4 consecutive ones can be detected by 2 consecutive ones followed by another 2 consecutive ones. In the original example,
(msb) 11111111 01111111 00111111 00011111 (lsb) [n = 1] Y0 = X
(msb) 01111111 00111111 00011111 00001111 (lsb) [n = 2] Y1 = Y0 & (Y0 >> 1)
(msb) 00011111 00001111 00000111 00000011 (lsb) [n = 4] Y2 = Y1 & (Y1 >> 2)
(msb) 00000001 00000000 00000000 00000000 (lsb) [n = 8] Y3 = Y3 & (Y2 >> 4)
The reason why this works is that in computing Y2, we use the fact that Y1 has gaps of 2 consecutive zeroes to shorten our span of ones, and in Y3 uses the fact that Y2 has gaps of 4 consecutive zeroes, and so on. This is another type of SWAR algorithm.

We will keep doing this until we reach a power of 2 that is <= n. Then the last step we just apply a shift to truncate off the remainder of the bits.

In the original example, suppose we want to find n = 7 consecutive ones.
(msb) 11111111 01111111 00111111 00011111 (lsb) [n = 1] Y0 = X
(msb) 01111111 00111111 00011111 00001111 (lsb) [n = 2] Y1 = Y0 & (Y0 >> 1)
(msb) 00011111 00001111 00000111 00000011 (lsb) [n = 4] Y2 = Y1 & (Y1 >> 2)
(msb) 00000011 00000001 00000000 00000000 (lsb) [n = 7] R = Y2 & (Y2 >> 3)
The last step R, instead of shifting by 4, we only need to shift by 3.

The final algorithm is this:
typedef unsigned int bitvec_t;

bitvec_t consecutive_mask(bitvec_t x, unsigned int n) {
  unsigned int stop = prev_pow2(n);
  for (unsigned int i = 1; i < stop; i <<= 1)
    x &= (x >> i);
  if (stop < n)
    x &= (x >> (n - stop));
  return x;
}
The for-loop runs in O(lg n) time. We still need to show that prev_pow2(), which computes the power of two that is <= n, also runs efficiently. Again, this can be done using a SWAR algorithm.
unsigned int prev_pow2(unsigned int n) {
  return (fold_bits(n) >> 1) + 1;
}
where fold_bits() is the SWAR algorithm defined as follows:
unsigned int fold_bits(unsigned int x) {
  x |= x >> 1;
  x |= x >> 2;   /* last for 4-bit */
  x |= x >> 4;   /* last for 8-bit */
  x |= x >> 8;   /* last for 16-bit */
  x |= x >> 16;  /* last for 32-bit */
  x |= x >> 32;  /* last for 64-bit */
  return x;
}
The remark "last for k-bit" indicates that if x only has k bits, then that line is the last operation for the fold_bits algorithm. Since our n only has 5 or 6 bits, we only need to do up to "last for 8-bit." It turns out that this part is also O(lg n).

Here we presented the problem of finding the positions of all consecutive bits of ones in a bit vector, and an algorithm to do it in O(lg n) + O(lg n) = O(lg n) time. An application for this algorithm is dynamic storage allocation. Suppose I have each bit in the bit vector to indicate the availability of a storage cell, then this algorithm allows me to find places where I can reserve n consecutive storage cells in O(lg n) time.

Tuesday, June 8, 2010

pthread atexit()

The atexit() function registers a function to be called when a process terminates, via exit() or via return from main(). However, when a pthread is terminated, pthread_exit() does NOT call any of the atexit() functions. There is a similar pthread_cleanup_push() which does get called, but it is expected that pthread_cleanup_push() and pthread_cleanup_pop() are a matching pair in stack order, so it only works if pthread_exit() sits right in the middle between a pair of the push() and pop().

Fortunately, when a pthread exits, it calls the destructor function specified by pthread_key_create(). Although all threads need to use the same destructor, we can write a destructor that calls another function whose pointer is specified by the thread specific data, facilitating thread local destructor.
typedef void (*exit_t)(void);

void thread_local_destructor(void *arg) {
  exit_t destructor = (exit_t) arg;
  arg();
}
It should be a doable exercise to extend this to register an arbitrary number of functions to run at pthread exit.

Bridging the high level concept with low level detail

I was fascinated with this blog post written by an undergrad, exclaiming with excitement his teaching fellow's announcement saying that "binary search tree and merge sort are broken." From the detail that he cited (posted at Google Research blog, written by Joshua Bloch, the author of Effective Java book), I think he meant binary search, and not binary search tree. The problem has to do with overflow in this line:
int mid = (low + high) / 2;
The assumption is that low, mid and high are both indices to some array, with low <= mid <= high. When you add low and high, the result could overflow and become negative, and mid becomes negative. This causes array index out of bounds. In Java, the code will crash right away due to the run-time array bounds check. In C/C++, it might access bad memory location and cause undefined behavior, including non-termination. The problem will be a different one if you change int to unsigned int, in which case mid will be smaller than both low and high. Although mid will fall inside array bounds, it's still a violation of loop invariant, which could make the code non-terminating in C (Java doesn't have unsigned int type). The fix that Joshua suggested was to write:
int mid = low + (high - low) / 2;
Another fix, if you know that you're using a binary computer (which is most computer nowadays), is this aggregate magic hack to average two integers using bitwise operators.

I'd like to point out that the algorithm is not broken. An algorithm is a high-level concept of how to do the work. The algorithm is correct, and it's provably correct. What is broken is the program, whose capability is restricted by the finite machine they're run on. However, when you translate from high-level to low-level, the loss in computing power (from infinite state to finite state) causes program error. The problem with the kind of computer science education I'm familiar with is that we often teach computer science in the high-level concept, with the (flawed) mentality that mapping it to low-level program will be straightforward. Only someone who actually writes the program will know that the devil is in the details.

Some programming languages are better at bridging the gap than others; one that came to mind is Scheme, which uses arbitrary precision integer, as well as built-in rational number. It is a very useful language for describing algorithms because many of the machine-specific aspects are hidden by the language and its run-time system. However, programs written in Scheme will run slower because of the arbitrary precision integer arithmetic. ML programs using the IntInf package also share the same benefit and shortcoming.

What I think would be very useful is to write the algorithm as a template program that can be instantiated with different integer sizes, but let the compiler deduce its limits. For example, the binary search bug could be avoided if overflow does not happen. The compiler will generate a precondition requiring that the caller of that function must check that low + high <= INT_MAX. If int is 32-bit, then INT_MAX is 2147483647; if int is 16-bit, then INT_MAX is 32767. Furthermore, since 0 <= low <= high < array_length, it suffices to show that if array_length <= INT_MAX / 2, then the function will be safe. The last step is perhaps not deducible by the compiler, but a programmer can use that fact to convince the compiler that the limits are satisfied.

You can almost write such code in the programming language ATS because the compiler statically checks for integer constraints. You will just have to use #include template technique: prepare the algorithm in a separate include file, while each instantiation will define the concrete type int and the value INT_MAX before including the algorithm. However, the language default + operator does not check for overflow, so you will have to write a new declaration of the + operator function type with the appropriate integer constraint. Maybe I'll write a tutorial in another post.

Saturday, June 5, 2010

Parallel computing and the cost of TLB shoot-down

Translation lookaside buffer (TLB) is part of the CPU's virtual memory facility to cache the mapping of virtual memory page to physical memory page, which speeds up the lookup of virtual address to physical address. Without TLB, virtual memory would require several physical memory accesses for every virtual memory access for looking up from the page table. TLB makes virtual memory practical, and it is one of the earliest form of memory caching. Virtual memory gives the illusion that the system has more available memory than actually installed. The operating system automatically arbitrates the use of main memory, which alleviates the programmer the burden of overlay planning. It is said to dramatically increase programmer productivity. The way virtual memory works is described in detail in The Locality Principle.

In the very beginning, each process running on the operating system corresponds to one virtualized computing resource. It has exactly one CPU, and it runs in an isolated address space. On a shared-memory multi-processor system, each processor will run one process. They only need to synchronize (wait for each other) when a process causes the operating system to acquire or release physical memory; but then, only the processes that acquire or release memory needs to be synchronized. All other processes can run uninterrupted and in parallel.

When the operating system introduces multi-threading support, things get complicated. Originally, multi-threading allows concurrency in an I/O bound program, which alleviates the burden of the programmer to multiplex the control flow over I/O operations that happen concurrently. However, multi-threading also allows a CPU bound process, which as multiple threads of execution, to run across several processors, all of which would share the same address space. And in this case, each processor has its own virtual memory unit, in particular the TLB, that has to be synchronized whenever there is a change to address space mapping.

The synchronization generally only needs to happen when an existing mapping is modified, or when unmapping occurs. When creating a new mapping, it is guaranteed that the TLB entry for the new mapping wouldn't exist, so no synchronization is needed. If the TLB is not synchronized, then other threads may still have access to the old mapping, which might be a physical memory page reused for other purpose. The synchronization uses a technique called TLB shoot-down.

When the mapping change happens, the processor performs the change, flags the shoot-down request at a shared memory location, and then triggers an Inter-Processor Interrupt (IPI) to other processors. When those processors receive the interrupt, they clear the TLB cache, and acknowledge that the request has been completed. The original processor has to wait until all processors acknowledge the completion.

One can imagine why TLB shoot-down is costly. The most direct consequence is that TLB has to be rebuilt, which is only cost-effective if the cost is amortized over time. Frequent TLB shoot-down will cause the virtual memory to perform poorly.

Furthermore, not only the processor that makes the change has to synchronize with the change; all the processors that share the address space are affected, regardless whether they are interested in the change or not. CPU-bound processes that want to scale typically have their threads all dealing with a private memory area in a shared address space. This is to take advantage of the CPU cache, which they use for local memory, so that bus traffic may be reduced. Bus traffic is the bottleneck of scalability in any parallel computer. However, TLB shoot-down will slow down these other threads even if they effectively use their own private memory area. The cost of TLB shoot-down increases as the number of CPUs increase. This is studied extensively and serves as a motivation of a multi-kernel called Barrelfish.

It looks like if you're looking to write a shared-memory parallel program that is CPU-bound, and want to scale to a large number of processors, you would want to reduce the number of unmap operations. Of course, the alternative is break down the threads and revert to using several address isolated processes that communicate using well-defined channels such as MPI. This will also work well on distributed systems.

Thursday, June 3, 2010

strxcpy—More Consistent, Safer String Copying and Concatenation

This is now published as a Google Project under the name strxcpy. The full motivation article that was previously found in this blog post has been relocated here.

The C language makes it easy to write programs with buffer overflow problems, but there is still something the library can do to help the programmer make fewer mistakes. For this reason, we propose strxcpy(), a variant of strcpy() and strcat(), which combines both string copying and concatenation.
char *strxcpy(char *dest, const char *dest_end, const char *src, size_t n);

The function takes two parameters to define the buffer, dest and dest_end. It guarantees that it will not overwrite the memory location past dest_end, and the string at dest is always NUL terminated. Furthermore, it does not require dest to be a NUL terminated string. The return value is a character pointer to the NUL terminator of the dest string, so that we can easily append another string to it by calling strxcpy() again.

We also propose the sprintf() variants using the same idiom.
char *vsxprintf(char *dest, const char *dest_end, const char *fmt, va_list ap);
char *sxprintf(char *dest, const char *dest_end, const char *fmt, ...);

See the motivation for buffer overflow problems.

Monday, May 31, 2010

Estimation of Safe Walking Route using Google Street View

Woman Follows Google Maps “Walking” Directions, Gets Hit, Sues. Despite how she claims that her blackberry did not display the usual warning “Walking directions are in beta. Use caution – This route may be missing sidewalks or pedestrian paths,” anyone with a common sense could see from street view that there is no sidewalk, and the path might be dangerous for a pedestrian to cross.

But can Google use street view to automatically assess the walkability of a path? Maybe. At least on busy streets, you will find pedestrians. A lot of image recognition research has gone into detecting people in an image, so if you detect people in a street view, it is likely a walkable street. Even if there is no people, it might be possible to detect common pavement painting pattern such as pedestrian cross-walk, or pedestrian walk lights. A street with those computer detectable elements is very likely to be walk-safe.

GitFarm source code?

It is a popular request to GitFarm for its source code, but these requests are mostly ignored by its provider, Bo (yboycn). GitFarm allows you host a Git repository on Google AppEngine. It turns out that if you know a bit of Java, you can put together something like GitFarm over the weekend. Most of the pieces you need are already available as open source.
  • JGit, Git implemented in Java (EDL licensed).
  • GaeVFS, a filesystem for Google AppEngine using datastore and memcache API (Apache licensed).
And if you build your Git hosting on AppEngine that way, you are not compelled to publish your source code. Both EDL and Apache License are only triggered by distributing source or object code, not software as a service. You are not even compelled to reproduce a copyright notice for providing software as a service.

Friday, May 28, 2010

Tracking Object Ownership in C++

Ever since the beginning when people program in C/C++, the need to track object ownership is recognized, but there has been a lot of confusions. We need ownership tracking in order to know who is responsible for freeing an object. This is essential for preventing memory leaks, since C++ does not typically use garbage collection (though it can use something like Boehm GC). Ownership tracking is also essential for preventing double-free, which can corrupt the memory allocator's internal data structure.

Ownership tracking does not sufficiently prevent all memory related error. For example, another part of the program could retain an aliased const reference to an object and use it after the true owner frees it. After the memory occupied by the object is returned to the allocator, it could be subsequently used for other purpose, which could be corrupted when someone modifies the aliased original object that no longer exists.

We will need to use the concept of linearity if we also want to prevent aliasing, but a lot of seasoned C++ programmers don't understand this. It is evident, for example, from their disdain about std::auto_ptr<>: when an STL algorithm such as sort() is broken when you use auto_ptr, they blame auto_ptr for the culprit. However, I would argue that the culprit is sort(), which creates an alias for an item in the container and abandons it without proof that it is safe to do so (perhaps because by default, C++ treats everything as plain old data, which means copying and assignment has no side-effect). If an algorithm observes object linearity, then it will work with auto_ptr<>. There has been significant research in the programming language ATS which strictly enforces linearity in its type system, and there is a lot of code implementing common algorithms that satisfy linearity.

While it is great if you can use a new language with built-in linearity checking, what if you're stuck with C++ (e.g. I miss template meta-programming and curiously recurring template pattern)?

The real problem I have with auto_ptr<> is that it automatically frees the object when the auto_ptr<> object is destructed. This means that non-linear code can run for a while, erroneously freeing objects until everything mysteriously disappears, and you will be puzzled by what is going on. Distinguishing move semantic and copy semantic is a start: it alleviates the need of the auto_ptr_ref hack (otherwise auto_ptr<> cannot be returned from functions due to the non-const copying constructor). But move semantics doesn't enforce linearity; items can still mysteriously disappear after they're moved out of the container but not moved back in.

C++ type system does not have the power to enforce linearity during compile time, but the next best thing to do is to make programs crash early when non-linearity is detected. I've been using a home-brew linear_ptr<> class which the only difference from auto_ptr<> is that, whenever there is a delete p, I remove it and replace it with assert(p == NULL). Regardless of move or copy semantics, a linear pointer only tracks the ownership but never frees an object. I've been using this ubiquitously in an implementation of a memory allocator with great success.

In using linear_ptr<>, I came to realize that a lot of times, a function or method doesn't intend to take ownership of the object passed in as an argument, nor does it alias the object pointer, but we want to have call-by-reference argument semantics. In these cases, I use the idiom const linear_ptr<T>& which means that the pointer itself should not be modified (still points to the same object, not taking ownership) but the object itself is non-const, so it could be mutated.

It turns out that in a regular C++ program, you don't need to use linear pointer to express the intention that a function does not intend to take ownership of an object. Just use a reference! The key idea is that delete takes a pointer, not a reference. Since a reference cannot be deleted, it signifies the fact that the caller does not pass ownership of the object to the function. Furthermore, it signifies that the reference cannot be NULL, so the object has to exist prior to the call. The caller loans the object to the function; the function borrows the object from the caller. Of course, the function should still refrain from converting the reference to a pointer or making an alias of the reference, but it is much easier for a programmer to self-discipline in a local context of a function.

The role of references in C++ is often under-appreciated. For example, Google C++ Style Guide imposes a rule on references that prevents you from writing a function that wishes to mutate an object but will not claim ownership.

However, it is important to note that, unlike pointers, C++ references cannot be reassigned. Assignment on a reference will be applied on the object being referenced to, not the reference itself. This means that if you want to repeatedly borrow objects in a recursively defined data structure (e.g. singly linked list, binary search tree), you will have to write recursive functions rather than doing it in a loop.

In the case where you really want to write a loop that repeatedly borrows objects from a data structure, you can use a loan pointer, where the object pointer is cast to a pointer of a loan class of that object. A loan class is a mixin that merely disables the constructor and destructor by making them private.
template<class superTp>
class loan : public superTp {
 private:
  // Disallow constructor and destructor.
  loan();
  ~loan();
};
It most certainly should disallow copy constructor as well, but this is left for an exercise to the reader. When we want to signify the intention to retain ownership, we will make a loan pointer using the following function:
template<class superTp>
loan<superTp> *make_loan(superTp *objp) {
  return static_cast<loan<superTp>*>(objp);
}
A similar function can be written for linear_ptr<> as well, to convert from a const linear_ptr<T>& to loan<T> *.

A loan pointer can be reassigned to reference a different object, but the usual caution about aliasing applies. A function should not retain the alias beyond the scope of the function. I rarely use the loan<T> * idiom myself because C++ reference works most of the times for me to write a function with the intent not to claim object ownership from the owner.

In conclusion, yes, there is a difference between pointer and reference in C++, and the difference is often under-appreciated. You may use pointer argument to signal the intent of the function to take ownership from the caller; but when you use a reference argument, the caller keeps ownership, and the function merely borrows the object for use in the function scope. This takes care of ownership dispute in function call. However, outside of a function call, auto_ptr<> is not enough to enforce linearity in code. Since C++ type system does not statically enforce linearity, the next best alternative is to use linear_ptr<> that detects non-linearity and crashes the program in run-time the earliest possible. These facilities can be used together to help programmers write linear code that is free of memory leak and memory corruption problems due to double-free and free-after-use.

Wednesday, May 26, 2010

Passing File Descriptor over Unix Domain Sockets

This is a follow-up to my rant in Secrets (everyone should know) About Running Web Servers. I made a few recommendations about structuring a server written in C/C++ and in Java. Namely, the server will consist of several worker processes, and a monitor process that dispatches the work. The overall design is the same for C/C++ and Java, but the motivation and rationale are different.
  • C/C++: since memory leak for programs written in this language is unavoidable, there should be a way for a worker process to restart periodically (after a certain number of requests). We need to be able to "switch on" and "switch off" any worker process in a pool of processes without exposing downtime.
  • Java: multi-threading interacts poorly with generational garbage collection, so if we don't want to resolve to fudging nursery size, worker process should be single threaded. However, we will be running several concurrent worker processes so we can take advantage of the multi-processor hardware.
Both of these require having a lightweight "monitor" process that listens for network sockets, passes an incoming connection to one of those workers, and monitor the health of these workers. We don't want to fork on socket accept() because some application's startup cost could be quite high (e.g. JVM).

It might be tempting to use FastCGI, but it turns out that the FastCGI protocol requires copying from worker to the monitor, which copies the response back to the client. We want to avoid any redundant copying at all cost. This means once the monitor accepts a listening socket, it will pass the socket directly to a worker, and leave it up to the worker to communicate with the client.

It turns out that folks who designed BSD Unix had already foreseen this scenario. Nowadays, it is possible to do so at least on AIX and on Linux. The basic idea is to establish a Unix Domain socket between monitor and workers. The monitor sends opened socket "credentials" to the worker using a special flag to sendmsg(). The worker receives the socket and works with it as usual.

Since this technique is platform specific (a quick glance on Windows Socket API function list does not reveal any facilities for passing open sockets between processes), we can safely bet that Java does not support it in the Java API. Unix Domain Socket implementations in JNI exist; unfortunately, neither juds nor junixsocket support passing file descriptors. This actually makes a nice business opportunity, selling Java monitor-worker based server framework.

The same can be said about C/C++. Although I'm pretty sure someone has already rolled their own monitor-worker implementation that does what I described here, the effort involved is non-trivial, so I'm willing to bet that someone is going to be willing to pay for such framework.

Monday, May 24, 2010

Mac OS X thread local storage

Historically, Mac OS X uses the Mach-O executable format, which does not support thread local storage data section found on most operating systems using ELF executable format (here is how ELF handles thread local storage as documented by Ulrich Drepper). If you try to compile a program using __thread modifier, you'd get the following error.
$ cat tls.c 
static int __thread x;
int main() { return 0; }
$ gcc tls.c 
tls.c:1: error: thread-local storage not supported for this target
However, it has been suggested that pthread_getspecific() has a fast implementation on Mac OS X. The pthreads thread specific functions are POSIX API facility for thread local storage (TLS).

On Mac OS X, pthread_getspecific() for i386 takes exactly three instructions. It is similar to ELF handling of TLS on i386, in the sense that both use the GS segment to point to a thread control block. However, the difference is that ELF only stores the thread pointer in the thread control block, which is then used to calculate the address of the thread local storage data section for that thread. On Mac OS X, pthread specific values are allocated from the thread control block in the GS segment. This saves another indirect reference, so that could be faster.

The way Mac OS X implements pthread_getspecific() for ppc target is more complicated. First, it looks up from the COMMPAGE (defined in cpu_capabilities.h) the address to several possible pthread_getspecific() functions implemented by the mach kernel, and branch there. From there, thread control block is either obtained from SPRG3 (special purpose register #3), or by an ultra-fast trap. The SPRG3 points to thread control block where pthread specifics are allocated from, so the approach is comparable to using GS segment on i386. The ultra-fast trap approach, despite its name, is a slower fall-back approach that appeals to the task scheduler to find out which thread is currently running.

There is no reason why pthread_getspecific() couldn't be implemented as efficiently as compiler's thread local storage variables. However, with __thread, the compiler could cache the address for the thread local variable instance and reuse it in the current function. If several functions are inlined, all of them could use the same pre-computed address. With pthread_getspecific(), the compiler doesn't know that the address to the pthread specific variable can be reused. It has to call the function again whenever the value is asked. For this reason, Mac OS X also provides inlined version of pthread_getspecific() to get rid of the function call overhead, as well as allowing the compiler to pre-compute the offset into the thread control block whenever possible.

I don't know how other operating systems implement pthread_getspecific(), but if they already support thread local storage variables, they should be able to implement pthread specific like this to achieve the efficiency as expected on Mac OS X:
// Assume we only need a finite number of slots.
#define MAX_SLOTS 512

// For storing thread specific values, using thread local storage.
__thread static void *specifics[MAX_SLOTS];

// For storing allocation status for a pthread-specific key.
static struct {
  int created;
  void (*destructor)(void *);
} key_descriptors[MAX_SLOTS];

// For locking the routine that has to manipulate key_descriptors,
// e.g. pthread_key_create() and pthread_key_delete().
pthread_mutex_t key_descriptor_lock;
This is inspired by pthread_tsd.c in Mac OS X Libc. Notice that, unlike their Libc, we can implement pthread_getspecific() and pthread_setspecific() using trivial inline functions or macros that access the specifics array.

Therefore, my recommendation is to use POSIX pthread specific in your program, which will work on any POSIX including Mac OS X. If your operating system has a slow POSIX pthread specific implementation, but the compiler has thread local storage support, then you can roll your fast pthread specific like shown above.

Thursday, May 20, 2010

Ambient lighting

Put these LED lighting strips in a clear water hose for ambient lighting project?

Secrets (everyone should know) About Running Web Servers

Datacenters are expanding, and a lot of them host web servers. It is only imperative that web servers run fast and lean. An ineffective web server implementation can cost you money in many ways.
  • You will need beefier hardware; they are more expensive, draw more power, and generate a lot of heat (requires high cooling cost).
  • You will need more servers. In addition to all the above disadvantages, they take more room, which costs you more in rent.
Although network bandwidth charges can be expensive, they scale nicely with the amount of traffic you receive. There are still things you can do to halve the bandwidth use, like compressing HTTP content. Other things like combining packet output can shave a few bytes from packet headers, and it improves latency, but these are not the main focus for this article. I want to talk about is, what can make you run a web server that is unnecessarily costly, purely due to the software that runs on the server.

First of all, PHP is expensive to run (i.e. Mark Zuckerberg is an idiot). In the very beginning, there was a web server that can serve static files. Then there was a web server that can serve dynamic content by spawning child processes and delegating requests to them using the CGI protocol. You can write CGI in any language, including C, C++, Perl, and even shell script. This satisfies applications where Server Side Include does not suffice. But for certain applications, the ability to embed code directly inside HTML code makes the program look more elegant because it saves you all the print statements. This is where PHP came along. It was first invoked by Apache over CGI. Each time you make a request to a PHP page, the PHP interpreter has to parse and compile your PHP code from head to toe into bytecode before it can run.

As PHP gains more popularity because of its ease of rapid prototyping, the application size increases. It became impractical to do head to toe parsing each time someone requests a PHP page, which nowadays can easily involve tens of megabytes worth of code. MediaWiki is 44MB. Joomla is 33MB. Wordpress is 9MB. PhpWiki is 1.4MB. This is where mod_php comes along. It avoids having to recompile a massive amount of code by doing it only once, and caching the bytecode in memory. Putting all the bytecode in memory seems to be a small deal, but Apache distributes its workload over several child processes, each of them has an instance of mod_php that caches its own copy of the bytecode. If you increase the number of child processes, you run out of memory. If you limit memory usage, you cannot fully utilize CPU and I/O. This is explained better in Why mod_php is bad for performance. Unfortunately, memory is still the scarce resource. There is a limited amount of memory you can put into each server. You will have to run more servers; they take up more space and cost more money.

Facebook recently developed a version of PHP called HipHop that precompiles your code. This will alleviate the need for mod_php hack, which trades memory usage for performance, a poor tradeoff (maybe Mark Zuckerberg isn't as dumb as I thought).

Now, back to CGI. I did mention that you could write CGI in any language you want. However, some languages have a pretty bad run-time startup cost (Java is particularly notorious due to its Just In Time compilation). A solution is to keep those CGI child processes running, hence the FastCGI protocol is born. This means that your application runs more or less persistently along with the web server. Some languages like Java or Python simply has a web server written in that language, which requires no bridging. FastCGI is just a clumsy way to bridge together a particular server software with dynamic site logic written in a different language, due to administrative reason, e.g. at a web hosting provider. If you can run your own web server, don't bother with FastCGI. For the rest of the article, assume you write a persistent web server from scratch.

The fact that your web application runs persistently on the server has several implications. If you write your application in C or C++, and it has a memory leak, it will keep accumulating memory usage until it runs out of memory and dies. You can limit the effect of memory leak by restarting the server process after a predetermined number of requests, and keep several processes active so that some servers are available to handle requests at any given time.

If you use a garbage collected language like C# or Java, you don't have to worry about memory leak, but multi-threaded request handling is known to interfere with generational garbage collection by prematurely putting would-be garbage in the tenure heap, which causes costly full-collection to run more frequently. There are two solutions, either to increase nursery size, or separate the application into several processes, each running only a few threads. You can mix the two approaches.

This pretty much covers all I want to say. Here is a summary of secrets that I want everyone to know, about running web servers.
  • Stay away from mod_php (or mod_perl, mod_python, mod_ruby).
  • Stay away from web hosting providers that insist everyone should use mod_php (or other mod_*s) because they are encouraging their customers to be more expensive, and they are often more pricey. That applies to 99.9% of web hosting companies out there.
  • If you run your own web server, don't bother with FastCGI. If you can help it, use a web server that can be embedded inside your web application, written in your application's native language.
  • It still pays to write a web application in C/C++, especially where scalability is concerned.
Oh, and I forgot. If you can use client-side scripting, use it. Each individual user's computer might compare feebly to your server, but collectively their computing power is more than you can rent from a data center.

Tuesday, May 18, 2010

SSH from a jail

In FreeBSD, when you chroot into a jail, you cannot access /dev/tty anymore because the current terminal is not created inside the jail. You'd see something like this:
$ cat /dev/tty
cat: /dev/tty: Device busy
However, that also means you cannot ssh out of the jail because ssh opens the terminal device when asking for password. One way to work around this is to use public key authentication. However, there is a simpler alternative, using an SSH_ASKPASS hack.

The manual page of ssh states that if both DISPLAY and SSH_ASKPASS are set, then the program specified by SSH_ASKPASS will run. The password is printed to standard output of the program. Instead of prompting for a password, we will write a script to echo the password out.
$ cat > ssh-login.sh
#!/bin/bash
echo open-sesame
^D
$ chmod +x ssh-login.sh
$ export DISPLAY=:0.0
$ SSH_ASKPASS=./ssh-login.sh ssh ...
Obviously, SSH will still not be able to access the terminal, so any SSH functionality that involves a pseudo-terminal will not work. You will be able to access non-interactive commands, SFTP, and port forwarding this way.

Wednesday, May 12, 2010

Abbreviating output to a line in the terminal

A very long time ago, I quickly whipped together a Perl script that will reduce the standard output from another program to just a line, prefixed with an animation character, and optionally suffixed with ellipses if the line is too long. It is useful if only the most recent output is significant, and that I only want to know if the program is making progress. It can be applied to many commands, e.g.
  • tar zxvf source-1.1.0.tar.gz | abbrev
  • unzip archive.zip | abbrev
  • make | abbrev
  • tail -F /var/log/messages | abbrev
I've long wanted to rewrite that Perl script in C, and now here it is (abbrev.c). The program is licensed under GNU General Public License, version 3.

Compiling is simple if you're using GNU make:
make abbrev
You don't need to supply a Makefile. The implicit rule will work. You can optionally supply CFLAGS=-O3 to the make command, as well as optionally strip the executable after make. To install, just move the executable to somewhere in your PATH. It should compile out of the box on Linux and Mac OS X.

There are some subtle features that one might not notice:
  • Output is throttled to a certain frame rate (default 30 fps).
  • Resizing the terminal window will cause the line length to adjust accordingly.
Hope you enjoy it.

Tuesday, May 11, 2010

Cost estimate for large project

A programming project of mine is growing in the number of lines of code. Although I write documentation as inline comments, I'm at a point where it is nice to have a reference. I've always known that open source projects tend to use Doxygen, which automatically parses your code and extracts documentation from it to be used as a reference. But this is not what I'm going to talk about today, though I'm using Doxygen as an example.

As I read about Doxygen, I noticed that the author offers some form of commercial support: implementing features, fixing bugs, and providing help to answer questions. I think it would be more useful if he could offer the service to retrofit a Doxygen configuration file for an existing project. In doing this, he would learn about large project requirements that he may build into Doxygen in the future. Another reason is that some people might not want to divert their attention to figure out how to configure Doxygen for their project, and they're willing to pay for the service (this is another plausible open source business model).

Most projects have a particular convention, so that setting up Doxygen is only a one-time effort. The nice thing about a computer is that, once the initial setup is done, the cost of recreating the reference from source code is miniscule, and it is a cost that most project teams should be able to absorb. The expensive part is the setup, which requires human attention. This applies to any automation for a project, so I'm going to stop talking about Doxygen from now on.

Not all projects are created equal. It is simple to setup automation for a project that spans only a few thousand lines of code. A large project with millions lines of code has a lot of hidden complexity. One source of the complexity is due to the fact that collaborative effort often has bits of inconsistencies because everyone does things somewhat differently. Even the same person changes the way he/she does things over time. In big-O notation, the amount of inconsistency is O(n) where n is the number of lines of code. In order to resolve these inconsistencies, we need to compare each k inconsistencies to the remaining (k - 1) inconsistencies (so we can devise a workaround or a fix), and the amount of effort required would be O(n2). This is the cost of setting up automation for a project.

It's easy to see that it is the most cost effective to setup automation from when the project is small. In addition, for a large project, the cost estimate would be proportional to the square of the size of the project. A well-maintained project is tree-structured, and (without proof) a divide and conquer strategy could be used to resolve inconsistencies that could hinder automation. The cost would be closer to O(n logn).

I think this very simple asymptotic analysis explains why money spent on multi-million dollar projects tend to vanish. A project involving n workers cannot allow each one of them to collaborate with one another directly, which means the communication complexity alone is O(n2). A ideal, hierarchical, tree-structured organization still requires O(n logn) amount of communication complexity for the time spent on meetings and planning. The cost for the actual work done is still O(n), and the total cost estimate is O(n logn) + O(n) = O(n logn). The cost estimate for a large project never scales linearly, and most directors under-estimate the cost if they use a linear model.

Friday, May 7, 2010

Multi-threaded allocator locking

Suppose we have an allocator that employs a headerless object layout strategy, like Streamflow. The free(p) operation needs a way to tell if p is managed by any of the subheaps, and if so, which subheap manages p. In essence,
void free(void *p) {
  SubHeap *h = GetSubHeap(p);
  if (h)
    h->Free(p);
}
We need a way to quickly lookup the subheap that manages a pointer. The mechanism used in Streamflow is to have a byte vector, each one corresponding to a page in memory denoting whether the subheap is present, and if so, which kind (I imagine subheaps would be in power of two number of pages, naturally aligned to its size). This mechanism is attributed to the Big Bag of Pages (BIBOP) due to Guy Steele, Jr.

On a 32-bit machine with 4KB page size, the byte vector would be 1MB. On a 64-bit machine, it will be a whopping 4096TB unless we use much bigger page size, or that the table would have to be sparse, such as using a multi-level hash table, or using some kind of a binary search tree. Both impose challenge in scalability in a multi-threaded setting because concurrent lookup and update are non-trivial.

Inspired by how Translate and Lookaside Buffer (TLB) works, I postulates that most of the lookup could be localized to the thread's own lookaside table. It is a linearly probed hash table, keyed by the lookup address, with a small probe distance (say, 4). The idea is that if we can't find the entry in four iterations of a search, then we might as well appeal to the global data structure. Hash table insertion may overwrite existing data. The policy to balance probe distance, eviction policy, and performance would be subject to further research.

So we will have GetSubHeap(p) implemented like this:
SubHeap *GetSubHeap(void *p) {
  SubHeap *h;
  if (h = LocalSubHeapLookup(p))
    return h;
  if (h = GlobalSubHeapLookup(p)) {
    LocalSubHeapInsert(p, h);
    return h;
  }
  return NULL;
}
The LocalSubHeapLookup() and LocalSubHeapInsert() operate on thread local data structure, so they do not need any synchronization. GlobalSubHeapLookup(), on the other hand, must acquire a read lock.

What might be of special interest is GlobalSubHeapRemove(). What if a cached entry of subheap is being removed? Ideally, the removal would only happen if the subheap is empty, which means the program should no longer have references to any objects in the subheap. The entry in the lookaside table would be left unused until it is evicted. However, many programs have double-free problems. This would cause free(p') to find a subheap from p' that is already removed, and call SubHeap::Free(p'). Therefore, the precise problem depends on what the function does. Another problem is if the subheap is removed, but another subheap is constructed on that address. The double-free problem would corrupt the subheap, again, depending on how resilient SubHeap::Free(p') handles corrupted pointers (the subheap might have a different object layout this time). Even if we keep the thread-local lookaside table up to date with the global mapping, there may simply be no good way to handle double-free gracefully.