Life of a Computer Scientist: 2009

Wednesday, December 23, 2009

Creating a block device backed by Amazon S3

There is a lot of benefit using Amazon S3 as a network drive, namely it will be automatically replicated across many storage nodes, so chances are the files will survive pretty well. However, S3 comes with several limitations: files have to be 5GB or less, and files can only be uploaded as a whole. Partial download works. No partial upload is a consequence of the distributed nature of the S3 storage system.

Of course, a user mode file system like s3fs over FUSE can always elect to shard a large file into smaller S3 objects, say 4MB each. It would need to hide the individual chunk objects and present only one coherent file. While this might seem adequate, hiding objects means that some file names are not allowed.

Another issue is that S3 objects are stored inside buckets that have no directory structure. Properly supporting directory structure means storing directory listing meta-info as an object in the bucket as well.

What if the user wants files to be encrypted transparently?

These issues all signify that, in order to build a proper network drive on top of S3, we need to implement a lot of file system features from scratch. It would be much easier if we could store a file on S3 and use it as a block device. Let's go back to the drawing board and redo the sketch.

First, we'll have a simple sharded S3 file system over FUSE that can store large files as chunked objects.
We create a large file to be used as a loopback device, format, and mount it.

dd if=/dev/zero of=/mnt/s3/fs.img bs=1M seek=$((device_size_in_MB - 1)) count=1
mkfs.ext4 /mnt/s3/fs.img
mount -o loop /mnt/s3/fs.img /mnt/fs

We get all features of the filesystem (ext4 in this case), including all sorts of POSIX ACL goodies, for free. The operating system kernel even implements caching for us, so we can enjoy great performance!

This also works similarly on Mac OS X: just use Disk Utility to create a disk image.

If you want encryption, no problem.

dd if=/dev/zero of=/mnt/s3/crypt.img bs=1M seek=$((device_size_in_MB - 1)) count=1
losetup /dev/loop0 /mnt/s3/crypt.img
cryptsetup luksFormat /dev/loop0
cryptsetup luksOpen /dev/loop0
mkfs.ext4 /dev/loop0
mount /dev/loop0 /mnt/crypt

One problem with using S3 as a block device is that the image file size is fixed, but depending on the filesystem you use, it should be possible to resize the filesystem online. It might even be possible to use ZFS by adding block storage to the pool.

If you're not happy with the 4MB per chunk data transfer, you can easily reduce the transfer amount. Instead of mounting the block device image over S3, do the following:

Build an Amazon EC2 AMI to mount the disk image on S3 as a block device.
Make sure ssh and rsync are installed on the machine image.

Then you can do rsync to backup your local files with only very minimal incremental transfer to the Amazon EC2 cloud. The 4MB per chunk data transfer between EC2 and S3 (within the same region) is free and likely very fast. The instance itself only needs to be running when you start the backup, and can be terminated as soon as the backup is done.

Now, the only missing piece is a FUSE module for sharding large files into smaller objects on S3. Anyone wants to take on that?

Saturday, December 19, 2009

Wireless Doorbell Instant Messaging Bridge

The apartment building I live in never has a working intercom. It has something that resembles an intercom box, and it is supposed to dial a preset phone number when you push a button for each tenant, but the landlord never bothered to keep the presets updated. My visitors can call my cellphone, but I often cannot have UPS or FedEx send packages to home.

Other tenants also have the same problem. They would leave notes repeatedly asking UPS to just leave their package at the entrance hallway, but UPS never do that due to policy reason. Some tenants attempted to leave a phone number, but once I ran into the UPS delivery guy---he told me the company does not issue them corporate cellphones, and he never uses his personal cellphone for work.

Although there are wireless doorbell systems that are very cheap (~$30), I don't want to just install a doorbell for my own unit. I want to have the doorbell send an instant message that can be broadcasted to anyone who is interested, including other tenants of this apartment. The question is how to build a system like that.

The first iteration of this idea involves using wireless sensor network, but they require interfacing with a gateway, which must be connected to a computer for Internet connectivity. I want to leave the computer out of the picture. Besides, wireless sensor networks aren't cheap. I forgot the exact pricing range, but a gateway could cost $300-$600 or more, and each node is also $100-$200. And since many of these are still academic prototypes, you don't find many places to buy them. Someone could steal the doorbell, and it could cost me $200 to replace it. This seems like an expensive doorbell.

Then I wondered if a "system on a chip" would be a good idea. I found the PIC Micro Web, which is really a small computer with a parallel port and an ethernet port. I could solder a push button to the parallel port, which would fire something off the TCP/IP on the ethernet side. The price is reasonable. The only problem is, it only works on Ethernet. In search of similar units, I found Digi Connect ME, which has a sibling wireless model Digi Connect Wi-ME. This theoretically allows me to connect the door bell push button to my wireless home network and send instant message over my DSL. However, it's still a bit pricy, at $130 per unit.

Then it occurred to me that I could use a Wireless to Ethernet bridge which may or may not support Linux. I found Linksys WGA600N, which based on a Ubicom chipset that has a Linux based SDK. With that device, maybe I could find out which one is the serial port, and repurpose that for door bell button connection. The cost is $80, but I think it is acceptable.

Wednesday, December 16, 2009

ZFS on Amazon EC2 EBS

One of the Amazon EC2 advantage for customer like you and me is that you can rent a computer from their data center, and you're only charged what you use—computing hours, instance sizes, network transmission—with the exception of EBS where you are charged by the amount of storage you provisioned to a volume. For example, you're still charged 1TB of storage cost if you only used 1GB of that 1TB. Since EBS is just a block device, there is no way for Amazon to tell how much of it you actually used.

What happens if you grow out of the space you initially provisioned to a volume? You can expand it like this: detach the EBS volume, make a snapshot of the volume to S3, use that snapshot to create a new EBS volume of a greater size, and attach the new volume. You will likely need to use an operating system tool to resize the file system to the new volume size as well.

While this might work well for some people, I'm not entirely happy about this approach. The time it takes to make the snapshot is directly proportional to the amount of data you have stored. The down-time seems unavoidable.

I've been studying management of a ZFS pool, and here are my findings. You can create EBS volumes and add them to a pool. Then when you want more space, just create a new EBS volume and add them to the pool. The pool will enlarge automatically to reflect additional space.

The smallest EBS volume you can create is 1GB, but I don't suppose anyone would want to create lots of 1GB volumes. This will be a nightmare to keep track of. Fortunately, ZFS also allows us to replace smaller disks with larger disks. You can keep a pool with about 3-4 EBS volumes, and when you want more space, just create a new one with more space, and use it to replace the smallest disk in the pool. This way, only 1/3 or 1/4 of the data in the pool needs to be transfered. Furthermore, all ZFS pool operations are performed online, so there is no downtime.

What if you want ZFS mirror or ZFS raidz redundancy? I found out that, unless you're one of the lucky users of Solaris Express build 117, which provides autoexpand capability, the disks that are part of a mirror or a raidz are not automatically expanded even after all disks are replaced with larger disks. Such is the case for zfs-fuse on Linux. However, I found out that a zpool export followed by zpool import updates the size. Or you can reboot your computer. Again, the downtime now is the amount of time it takes to reboot your EC2 instance, and not the amount of time it takes to make snapshots, which is much better.

The disadvantage now, however, is that you can no longer snapshot the whole filesystem at once, which spans across multiple EBS volumes.

Wednesday, December 9, 2009

Supporting Math on Google Wave

This is not a software release, but I spent a few hours plowing through the documentation on Google Wave API, and I'm just noting down how it may be plausible to support LaTeX math equations on Google Wave.

The first aspect is rendering math equation. One essentially makes a Wave Gadget that decides how to turn LaTeX into presentable math equation. There are two possibilities: (1) use the undocumented Google Chart API, but the LaTeX equation length is extremely limited; (2) embed jsMath, and use it for math rendering. The disadvantage of the jsMath approach is that it is rather heavyweight, since each gadget has to run the same jsMath initialization code separately. A third option, which I don't consider a possibility, is to host NTS (a Java implementation of TeX) on AppEngine. The NTS implementation of full TeX stack (TeX to DVI, DVI to image) might be considered heavyweight, but the real challenge would be to workaround its reliance on a real filesystem, which makes it difficult to port to AppEngine sandboxed environment.

Once the rendering part is done, it may be convenient to convert LaTeX code to the rendering gadget on the fly. This can be done using a Wave Robot that listens for document changes and scan for LaTeX code in the blip. I think this is relatively straightforward. However, a concern is that each change requires the robot O(n) time to rescan the whole blip. Editing in Wave would then become noticeably slower as document size grows. This scalability issue must be addressed. I think the current workaround is to have Wave only contact the robot sparingly. This also means robot updates are going to be slow. It may be possible to integrate everything better in the future, by processing update events directly in the Wave Client without using a robot.

Wednesday, November 11, 2009

Emulating C++0x auto keyword in C++

GCC has a typeof(e) operator (more pedantically, __typeof__(e), as typeof is not in C99 nor C++89 standard) that can compute the type of any expression e. It's typically used by macros, whose arguments aren't typed, to create local variables that has the same type as its arguments. For example,

#define max(a, b) \
  ({ __typeof__(a) _a = (a), \
     __typeof__(b) _b = (b), \
     (_a >= _b)? _a : _b })

This definition of max(a, b) prevents evaluating the expressions a and b more than once. These expressions can cause side-effects, such as printing a message to the screen.

The typeof operator apparently also works for C++. It is sometimes awkward to write out the type of the whole expression. For example,

void foo() {
  std::vector<int> v;
  for (std::vector<int>::const_iterator iter = v.begin();
       iter != v.end();
       iter++) {
    /* ... */
  }
}

Writing the type for the iterator is cumbersome. In C++0x, this code can be written like this using the auto keyword.

void foo() {
  std::vector<int> v;
  for (auto iter = v.begin(); iter != v.end(); iter++) {
    /* ... */
  }
}

In C++, using the __typeof__ keyword, we can write a macro like this:

#define autovar(x, exp) __typeof__(exp) x = (exp)

And the code above would look like this:

void foo() {
  std::vector<int> v;
  for (autovar(iter, v.begin()); iter != v.end(); iter++) {
    /* ... */
  }
}

Here is another use of the typeof operator: sometimes C++ can't figure out how to instantiate a template, and manually writing the type arguments can be tedious. You can use __typeof__ to help the compiler figure out the correct template parameter.

Apparently Boost provides a BOOST_AUTO macro that uses pretty much the same technique.

In C++0x, the typeof keyword is called decltype.

Tuesday, September 22, 2009

Google Chrome (Chromium) for RHEL (CentOS) 5

For a long time, I've kept a Windows machine for Remote Desktop access that I use just so that I can use Google Chrome. The reason is that editing Google Sites pages, for example, drags Firefox 3.5 and Opera 10 to a slow crawl. And Firefox 3.5 on Linux (since beta2) crashes my X11 server because the video driver (xorg-intel) is old and buggy. But I've long wished to abandon Windows altogether. I share the Windows machine with another graduate student, and the Windows Server 2003 license for that computer restricts two remote desktop connections. This is surely absurd for me as a Linux user because I've along accustomed to unlimited number of connections (subject only to resource availability as opposed to artificial license restriction). Granted, I could have used VNC, but it's not as responsive.

It is currently not my best interest to figure out how to compile Chrome from scratch (they appear to be using some gclient custom build tools). However, the precompiled binary snapshots require more recent versions of FreeType, Gtk+, and NSS (network security service, as part of Mozilla) than that available on BU Linux (which is a RHEL/CentOS 5 derivative).

Compiling Gtk+ is a nightmare because of the sheer number of dependencies. Here I'm listing the packages I need. The listing format is "target: dependencies"

gtk+: atk glib cairo pango jasper fontconfig
pango: cairo freetype glib
cairo: libpng freetype pixman fontconfig
fontconfig: freetype
atk: glib
glib:
libpng:
freetype:
pixman:
jasper:

The order listed above is in reverse topological order. I just put them in a Makefile and let make figure out the order for me. Fortunately, these packages are standard GNU autoconf/automake based. You will have to set export variables as follows:

PREFIX := ... # you pick a convenient location
export CFLAGS := ... # you pick the compilation options, e.g. -march=i686 -O3
export CPPFLAGS := -I$(PREFIX)/include
export LDFLAGS := -L$(PREFIX)/lib
export PKG_CONFIG_LIBDIR := $(PREFIX)/lib/pkgconfig
export PATH := $(PREFIX)/bin:$(PATH)
export LD_LIBRARY_PATH := $(PREFIX)/lib

You will have to run configure with --prefix=$(PREFIX), and then run make and make install inside your Makefile.

I also wrote a script to essentially do the following. This will be left as an exercise to the reader.

Given a target name like gtk+, locate the source package tar.gz and unpack it.
Make a new object files directory and run the configure script there as the current working directory. If the build fails, you can just remove the object directory to get a clean slate, instead of needing to unpack the source files again.
Do the traditional make and make install.
Touch the target as a file, so make doesn't have to repeat building and installing a target if it already succeeded.

Each rule in the Makefile would look like this:

# Assuming the build script is called build.sh in the current directory.
BUILD := ./build.sh

gtk+: export LDFLAGS := -lm  # for jasper
gtk+: atk glib cairo pango jasper fontconfig
$(BUILD) $@

On the other hand, NSS has its own build instruction, so I did it manually. The resulting nss shared objects must be symlinked in a particular way:

Add ".1d" suffix: libnss3, libnssutil3, libsmime3, libssl3.
Add ".0d" suffix: libplds4, libplc4, libnspr4.

I imagine that's Ubuntu/Debian convention.

I also compiled my own gcc 4.3/4.4, and without a recent gcc and libstdc++.so.6, you will get a dynamic linking error with the symbol GLIBCXX_X_Y_Z.

Tuesday, September 15, 2009

Benchmark Using High Resolution Clock

POSIX 1993 real time extension defines an interface to obtain clock tick down to nanosecond precision, using clock_gettime(). One way to implement this is to use Time Stamp Counter (TSC) that comes with the CPU, which is incremented by one per CPU clock cycle. On a modern gigahertz CPU, the TSC increments several times per nanosecond, so it can provide sub-nanosecond resolution.

Then, there comes CPU with power saving mode, which reduces the CPU frequency when underutilized. Some older CPUs increment TSC by one per cycle regardless of the frequency, so each TSC increment is not uniform (though still monotonic). For these CPUs, we can get a reliable count of instruction cycles, but we cannot use the TSC to track wall-clock time.

Later CPUs increment TSC by one according to the maximum CPU frequency, so the TSC increment takes uniform interval. However, when benchmarking code, the TSC no longer measures actual clock cycle. Code that runs in reduced CPU frequency will take more TSC counts to execute. A machine to be used for benchmarking purpose should turn off power saving mode. In particular, a laptop (having more stringent power requirement) is probably not well-suited for benchmarking.

There is also the issue of multi-processor machines. Each CPU may have slightly different frequency, and they're started at slightly different times at boot up. The TSC of each CPUs cannot be easily correlated. Even if a benchmark involves only a single process, the operating system might still migrate the process to another CPU, causing the TSC readout to be non-deterministic. On a multi-process machine, benchmark should be run with CPU affinity fixed on one CPU, using sched_setaffinity(). CPU affinity is desired also for cache and TLB reasons.

I do not currently know of a reliable way to benchmark multi-threaded program down to nanosecond precision. Here are some problems I could think of.

sched_setaffinity() works on a pid_t (i.e. the whole process). In general, pthread_t and pid_t are not interchangeable.
If we want to use one CPU as the master clock source, the intercommunication and synchronization overhead will be non-negligible. It should be later verified, but I'm guessing that the resolution would be ~100s nanoseconds.

Saturday, September 12, 2009

Compiling LyX with gcc 4.4

I recently upgraded to gcc 4.4.1, and today I found that I couldn't compile LyX without fixing the source code. The reason is because gcc 4.4 has a stricter interpretation of the C standard when it comes to preprocessor syntax. The #if ... #elif ... #endif block is now either all processed or all skipped. In other words, #elif does not behave like #else #if ... #endif anymore because the conditional expression of #elif no longer enjoys deferred computation. This is documented in gcc bug #36453.

LyX failed to compile because it distributed and included boost/mpl as part of the source code. A patch for boost/mpl is readily available. I downloaded the most recent patch (0001-boost.mpl-gcc-4.4-fixes.patch), saved it under $(SRCDIR)/boost (suppose $(SRCDIR) is the source directory for LyX, e.g. lyx-1.6.4), and ran from $(SRCDIR)/boost the command: patch -p3 < 0001-boost.mpl-gcc-4.4-fixes.patch

LyX should now compile correctly under gcc 4.4

Sunday, September 6, 2009

Abuse of read/write lock

A more fine-grained alternative to mutex lock is the read/write lock (rwlock). Mutex disallows two threads to access the same variable at the same time regardless whether the access is a read or write. With rwlock, the basic principle is simple: two readers can read from a variable (memory location) simultaneously without stepping on each other's toes. However, two writers cannot both write to the same variable, nor can one writer and one reader try to access the variable at the same time; both lead to inconsistent results.

There is one useful way to correctly abuse rwlock. Before we introduce the idea, let's talk a bit about lock-free stack implementation using singly linked list. Using compare and swap instruction, stack push can be implemented in the lock-free (and wait-free) manner. The compare and swap instruction provides a sufficient one-word transactional memory mechanism to allow changes made by another thread to be detected. This works because the node to be pushed to the stack is supposedly private. But stack pop suffers the complication of the ABA problem because the node to be popped is shared, so while thread 1 is in the middle of popping node A—let's say its successor is X at the moment—thread 2 pops node A, pushes node B, and pushes node A back. Now, A's successor is B instead of X. Thread 1 may mistakenly put reference to X as the top of the stack when it finishes popping node A.

With the singly linked list stack with push and pop implemented using compare and swap, we have the following access granting relationship. Two threads can safely push at the same time (guarded by compare and swap), but no two threads may pop at the same time, nor one thread to push and another thread to pop at the same time. This is exactly like a rwlock!

For this particular data structure, we could use rwlock to protect access. It is not exactly lock-free wait-free anymore, but maybe there are cases where the compromise, which is much easier to implement, is acceptable and useful.

Thursday, September 3, 2009

Migration to Gmail

Two days ago I successfully migrated two e-mail accounts to Gmail, one university e-mail and one department e-mail. Both of them are Unix based, and the IMAP server stores all mails except the INBOX in my home directory. Here is what I did.

Create a new Gmail account. It will hold both the university and department e-mails.
In my home directory, create a .forward file that contains the new Gmail e-mail address, and remove (or rename) the .procmailrc file.
Enable IMAP in Gmail.
Migrate old e-mails (I have about one year's worth kept in the mailbox) using imapsync. The university account runs UW imapd, and the department runs Dovecot. Here are the specific arguments to imapsync (besides login credentials) and why I needed them.

--folder INBOX (to specify the inbox folder)
--folderrec mail (this is the directory where all the mails are stored in Pine)
--prefix1 mail/ (mailboxes with this prefix are stripped by imapsync)
--regextrans2 's,Sent Messages,[Gmail]/Sent Mail,g' (I had used Apple Mail with these accounts too, and Apple Mail saves the sent e-mail in the Sent Messages folder, so I configured pine to do the same; now I want these to go under Gmail's Sent Mail folder)
--regextrans2 's,spams,[Gmail]/Spam,g' (I kept a separate spams folder; could have removed this folder before the migration).
I should have done this too: --regextrans2 's,INBOX,label-name,g' (the INBOX of each account goes under a different label). I simply did this manually after the fact which was a lot of work.

imapsync prior to 1.288 wants to download the list of all folders regardless whether you need them or not. This means the imap server will traverse all the files in my home directory, and this takes a lot of time. Version 1.288 solves this problem by only enumerating all folders when you use the --include option. I temporarily changed line 915 in imapsync as follows:
```
-my @all_source_folders = sort $from->folders();
+my @all_source_folders = sort $from->folders($prefix1);
```
When adding a new From e-mail in the Gmail account, they offer to setup an SMTP server. This eliminates the "From: yourname@gmail.com on behalf of youroldemail@example.com" message shown in Outlook.
I configured pine to access Gmail over IMAP as follows. I edited .pinerc to change the following settings:

inbox-path={imap.gmail.com:993/ssl/novalidate-cert/user=yourname@gmail.com}label-name (For each old e-mail I migrated, I tagged them with a dedicated label, so the inbox for Pine would show what was there before)
folder-collections=Main {imap.gmail.com:993/ssl/novalidate-cert/user=yourname@gmail.com}[] (This is so I can see e-mails under other labels)
default-fcc={imap.gmail.com:993/ssl/novalidate-cert/user=yourname@gmail.com}[Gmail]/Sent Mail
postponed-folder={imap.gmail.com:993/ssl/novalidate-cert/user=yourname@gmail.com}[Gmail]/Drafts

Lastly, I created filters to tag and archive incoming e-mails and then used the multiple inbox Gmail lab feature to be able to see e-mail coming in from both accounts at once.

Sunday, August 30, 2009

Analysis of Darwin kernel exploit

This is an addendum account of The Story of a Simple and Dangerous Kernel Bug.

On Unix, the fcntl() system call is used to manipulate file descriptors (duplication, close-on-exec, access mode, ownership, and locking). On Mac OS X, it is also used to provide read-ahead and write-back hints, and therefore some of the commands are file-system specific. The file-system specific commands are passed to the device driver as if an ioctl called has been made; on Unix, ioctl() allows the device driver to export any method call (associated to a file descriptor) to the userland.

The problem with passing fcntl() arguments to ioctl() in Mac OS X is that ioctl() and fcntl() checks their arguments differently. The argument depends on which command code is passed to the function. If the argument is a pointer, the pointer should be checked that it points to valid memory. However, fcntl() argument checking code does not recognize ioctl commands, so it just assumes that the arguments are integers. This allows unchecked pointers to be passed to device driver code, which can overwrite arbitrary memory address.

Wednesday, August 26, 2009

Summary of 32-bit Windows 3GB memory limit

Just read this article yesterday about how the memory limit is a licensing restriction more than a technical restriction, and thought I would summarize it below.

x86 hardware reserves the physical address space between 3GB to 4GB for PCI memory mapped I/O (i.e. talking to devices the same way we access memory), so this part of the RAM is hidden by design of hardware architecture on the motherboard.
Some motherboard hardware supports 36-bit addressing on the bus, and the RAM between the 3GB to 4GB range is remapped to a physical address higher than what 32-bit physical address can access.
Intel CPU after Pentium Pro comes with Physical Address Extension that allows a 32-bit virtual address to be mapped to a 36-bit physical address.

This boils down to:

If the motherboard supports 36-bit address and remaps the shadowed RAM between 3GB--4GB range, then an operating system (such as Linux, Mac OS X, Solaris) could use that RAM using Physical Address Extension, as well as any RAM beyond 4GB.

However, the licensing restriction on Windows works like this:

Prior to Windows XP SP2, there is no licensing restriction, so on a capable hardware, 32-bit Windows XP could use as much physical memory as supported by the hardware.
Due to licensing restriction, the kernel in and after Windows XP SP2 ignores RAM located at physical address space beyond 32-bit, effectively capping usable RAM to 3GB even if the hardware has 36-bit physical address and remaps the 3GB--4GB RAM.
This licensing restriction is lifted on 32-bit versions of Windows Server and Windows Datacenter, as well as 64-bit Windows. These versions of Windows can access all available RAM supported by the hardware.
Windows Vista SP2 now reports the amount of RAM installed on the system, not the amount of RAM used by the operating system.

Friday, August 14, 2009

Unit testing for C++ "concept"

When you gain the ability to do template generic programming like this:

template<typename T>
class MyClass {
// use T somewhere
};

It falls naturally that you expect T to provide certain functionality. For example, many people write code like this:

template<typename T>
class MyClass {
public:
T Add(T x, T y) { return x + y; };
};

This assumes that T has a well-defined operator+. It might be more precise to write this:

template<class T /* implements Addition */>
class MyClass {
public:
T Add(T x, T y) { return x + y; };
};

where the interface Addition could be specified like this:

template<typename T>
class Addition {
public:
T operator+(T that)();
};

This is to say that the T you use to instantiate MyClass cannot be any type, but only a type that has implemented addition. The concept of addition is not very interesting, so let's use another example.

Let's say you have the concept of a map<K, V> that is an associative map from keys of type K to values of type V. For the sake of testing, you can assume that K is both hashable and comparable (using an integer key will do). Then you are given two implementations of the map<K, V> concept, one implements the binary search tree sorted by K, and one implements hash table hashed by K. Most of the tests you want to do have to do with the concept of map<K, V> itself, e.g. insert, replace, find, and remove; not to do with the particular way the map is implemented. How do you avoid duplication? Use Type-Parameterized Tests.

However, here is a tricky bit. A concept does not describe how a class is constructed or destroyed, only that the class implements certain methods. What if the class does not have a default constructor? For example, the hash table requires an initial size. Most implementations of a hash table supplies a "good enough" default value for the default constructor, but there are cases where this is not applicable.

I figured out that you could parameterize the test using a factory class instead, and the factory would act like a default constructor. This is illustrated here.

However, this is not as straightforward as I would like because when a template class inherit from the base class which is also a template, the derived class does not have access to any of the protected variables! This is illustrated here. This code compiles, but if you replace the self->x_ and self->y_ references with simply x_ and y_, gcc would fail to compile.

Update: apparently gcc is correct, and the workaround is to simply prefix the member access with "this->."

Thursday, August 13, 2009

Unit testing not a good fit for exhaustiveness

So I implemented a layered memory allocator from scratch, somewhat in the fashion of Hoard (mine has a different abstraction). My allocator heavily uses C++ Mixin and Curiously Recurring Template Pattern (CRTP). Mixin is an overloaded term, but the concept I'm citing here is a class whose superclass is a template parameter. This allows a limited form of aspect-oriented programming where the Mixin is an added aspect of an existing class. CRPT is a recursive use of Mixin, which coerces the inheritance hierarchy of the user and the Mixin to form one class. The result allows static dispatch of methods which would otherwise have to be declared virtual, and subsequently dispatched dynamically using a virtual table.

Anyway, long story short. After creating many composable layers of memory allocator, I also created corresponding unit test for each layer. However, the unit tests only covered the best case scenario, that Alloc() never returns NULL. It worked well. But I also wanted to test the case where some allocation returns NULL, in order to tell if the layer handled NULL pointer propagation correctly. This unit test scenario turns out to be much of a nuisance.

To briefly explain why, no allocator always deterministically returns a valid pointer, nor does it always deterministically returns NULL, and it's the handling of the non-determinism that matters. Constructing a mock allocator that always returns NULL would still bypass some code paths. To make sure we traverse all code paths, we would have to construct an allocator that returns NULL after n successes for an arbitrary n, and run the unit tests for n = 1, 2, ... and so on. It may never exhaust all possibilities for a complex allocator layer such as the free list allocator. But then, what about a code path that is not reached unless the allocator succeeds every 2 attempts, or one that succeeds every 3 attempts, and so on?

NULL pointer is a special form of the 'a option datatype in ML. The concept is really simple. The 'a option datatype has two constructors, SOME which carries an item of type 'a, and NONE which doesn't carry an item. When a pointer is not NULL, it is assumed that it refers to a valid object, hence corresponds to SOME; when a pointer is NULL, it is assumed that no such object exists (typically due to an exception), hence corresponds to NONE. When you pattern match against the constructors of a datatype, ML also performs exhaustive analysis. Forgetting to check for NULL is a kind of non-exhaustive pattern matching. Since exhaustiveness is a local property, using unit test (which tests for a property that is much wider in scope) is overkill and ineffective. It is very hard to create a unit test that will run all possible execution paths, especially for the error handling code, because simulating all possible combinations of error condition is impractical.

Although a lot of times it's easier to write unit test for operations that are in-spec (e.g. to make sure a sorting algorithm permutes the sequence in order), it is best to leave exhaustive analysis to a compiler, like ATS, which effectively deals with out-of-spec situations.

Tuesday, August 11, 2009

The coming of virtual appliances

Everyone knows that web 2.0 has revolutionized people's private lives, with online social and entertainment applications like Facebook and YouTube. However, web 2.0 so far only has limited success for business applications in offices where the real money is at. While people are willing to entrust personal data to web 2.0, businesses can't do the same about their critical business data. Even well-known brand name like Google is fighting a hard fight selling the Apps for Domain service.

The business concerns are manifold. Privacy: do I get to control who has access to my data? Safety: once data is stored, will it persist and out-survive any forms of failure? Liability: is there someone I can sue and recover some damage when worse comes to worst? Being a business model still in its infancy, many web 2.0 companies have terms of service that are against all these concerns. Only after a while do people realize that these are non-issues with certain brand name companies simply because of the sheer number of user base and the lack of negative publicity.

Digression. In fact, some negative publicity became confidence boosters. When lightning stroke one of Amazon's data centers for EC2, it brought down a rack of machines. All their customers had to was to restart the instances on other machines that are still working. They lost only the temporary instance data that have not been committed to permanent storage, the same kind of loss if you unplug the power to your servers. It took only a few minutes to resume normal operation.

It is different if you are a small web 2.0 startup that wants to focus on business customers. How do you build a customer base when nobody wants to trust their data with you, and how do you build trust without a customer base? Furthermore, the remoteness of the web means it's more difficult to hold you physically accountable, so that is one more reason not to trust you. This is an obvious reason why web 2.0 business-oriented applications failed to flourish so far.

Also, as a small web 2.0 startup, you want to focus on developing the application, not on the plumbing. You can go back to pre-Internet model, selling software on a disc and have your customers install it on their own machines. However, some web 2.0 applications are built on top of many dependencies (web server, database, programming language runtime) as well as being intricately integrated with the operating system (disk access policy, user privilege, network firewall rules, performance tuning). It might not be your customer's best interest to run IT in-house either.

Cloud computing is a solution to this problem. Here comes a brand name utility provider like Amazon with a service like EC2 that you will more likely trust. And a third-party software company with a web 2.0 application you need. They give you a stateless virtual machine image (such as AMI) and charge you a license fee for using their software. Amazon provides the hardware, storage, and network infrastructure, and bills you for the operational costs. All the costs are fully transparent: license fee for the development and support of the application, and Amazon for only what you use.

When it comes to a customer who has built an infrastructure around VMware technology, the third-party simply rereleases their web 2.0 application as a VMware virtual appliance. It is easy to adapt to another virtualization technology because the only difference is the tool used for packaging a virtual appliance; the software stack---the operating system and all the dependencies---stay very much the same.

Once the web 2.0 company grows large enough, they could consider selling hardware appliances; or if they're willing, provide a hosted solution for their apps when someone asks for it.

To summarize, cloud computing with virtual appliances is going to be a great platform for business oriented web 2.0 technology, particularly lowering barrier to enter market due to lack of trust of critical data. And now, web 2.0 can finally enter the business sector.

Friday, July 24, 2009

Splay trees with duplicate items

I've implemented a splay tree, and I'd like it to store duplicate items, which are key-value mappings, as follows: if a key to be inserted already exists, then the new binding shadows the old one. If a key is to be removed, then the old binding is revealed. In other words, the inserting and removing the same key reveals the value in LIFO, or stack, order.
The most obvious solution is to have each node be a pointer to a singly linked list, with the cons/uncons operation trivially obeying stack order. But we could embed the singly linked list in a simple binary search tree if we relaxed the criteria a bit, for example making the left child <= parent as opposed to left child < parent. This effectively makes duplicate nodes a chain of singly linked list down the left path. The idea is that we always reach the parent first, which is the most recently inserted item. And if the parent is removed, it reveals the second next item of the same key.
The complication with Splay tree is that it could rotate the tree, which breaks the chain of duplicate nodes. Sometimes it is as bad as splitting the chain to left and right children to a node != the key of the chain, invalidating the binary search tree invariant. I figured out that, in the zig-zig and zig-zag cases, if the grand-parent node has the same key as the parent node, then it is possible to come up with special case rotation that preserve the chain. However, the terminal case actually needs to descend down the whole chain of duplicate notes. At this moment I gave up.
Also, another problem with embedding singly linked list in a tree is that it will increase the depth of the tree, hurting the overall running time. In this case, it seems that the simple approach, making each node a pointer to a singly linked list, is better. It only took me 5 minutes to change the implementation this way and it worked.
Update (9/15): Wikipedia claims that identical keys in a splay tree are preserved in its order like stable sort. It suffices to find the leftmost or rightmost node of the identical key. I haven't tried this.
Update (9/16): Here is the Wikipedia edit that added the aforementioned claim about stable sort. However, I don't think retrieving duplicate item is as simple as finding the leftmost or rightmost node. Even if insertion is stable, the duplicate nodes may not be continuous (e.g. the balanced binary search tree for 1, 2, 3, 3, 3, 4, 5, where the 3's are separated by nodes 2 and 4). Looking for the "largest" duplicate complicates the search by a great deal because you will have to probe either the minimal of the right child, or the maximal of the left child, and may have to back-track. Curiously, Splaysort did not mention how they deal with duplicates.

Wednesday, July 22, 2009

Captain's Log on Quest

Okay, not quite captain's log.
My office mate has been working on this operating system called Quest, and I'm keeping a record for him about what happened. He had been testing the OS on a virtual machine, and yesterday he decided to try it on real hardware. He found a reasonably recent hardware with APIC support in the lab. He would burn a CD-R, run upstairs, something goes wrong, write down some diagnostic on a piece of paper, run downstairs back to the office, change some code, burn a new CD-R, and run upstairs.
I hate to see CD-Rs being thrown into trash at an incredible rate. After all, CD-Rs are supposed to last for 100 years; whether they still preserve any data or not at that point is an entirely different question. Since each image is small (~2MB), I suggested burning multi-session CDs instead. He tried but couldn't get it to boot. I don't have any other ideas.
Today he bought some CD-RWs, which is an improvement. He also brought his laptop and stayed in the lab most of the time, which cuts the turnaround time a lot. When I visited him, he showed me a strange error that he fixed.
There is a static char arena[1000000]; in the code which is used for rudimentary memory allocation. He found out that if he did not initialize this arena to zero, then the code that initializes this memory with free list pointers would have page fault, accessing memory at some outrageous location (this is typical of array out of bounds problem somewhere else, polluting the arena).
I told him to try to memset to 0x5A instead. It will give us a clear indicator of what's wrong. He burned the CD-RW (had to erase the whole disk first) and booted it up. Lo and behold! The whole text-mode screen was then filled with bright green Z with pink background (thank goodness I didn't suggest 0xA5, or it would be filled with *blinking* purple Ñ on green). It seemed that the arena overlapped with text mode video buffer.
Another lesson learned about writing operating systems: double check special hardware physical address mappings.

Tuesday, July 14, 2009

Manifesto of a Very Hard To Find Memory Leak (To Be)

I'm implementing a hash table that will become the basis of a free list memory allocator. The hash table uses singly linked list for buckets. In the list implementation, there is a RemoveAll function written like this:

void RemoveAll(Predicate& p, Consume& c)
{
Node **result;
for (result = &first_; *result != NULL; result = &(*result)->next_) {
if (p(**result)) {
Node *node = *result;
*result = node->next_;
node->next_ = NULL;
c(node);

if (*result == NULL)
break;
}
}
}

Admittedly, the code is copied from a similar function called Remove(), which removes only the first node that satisfies the predicate. However, when it is modified to remove all nodes, an unintended side effect is, if a node is to be removed, it always skips the next node. When I wrote a unit test, I inserted nodes for 0 through 16, and removed all the even nodes, so this problem was not detected. However, when I used this function with a predicate that always returns true, I noticed that the list is not cleared properly.
Now, this function is used in the hash table to clear all the buckets. Each item in the bucket would be a free list for a certain size of objects. When a free list memory allocator uses this hash table, it will see memory leak on a certain size of objects, and the size would appear to be at random.
Fortunately, when a list is destructed, I have it do nothing but to assert that the list is empty. This helped detecting the problem early. Unit test for the hash table to clear all entries helped bringing this problem to the surface.

Hash table using a variety of things for bucket

A hash table stores key-value mapping of data in an array of buckets. The bucket number for a particular key-value pair is decided by a hash of the key modulo the number of buckets. The idea is that if the average number of key-value pairs in buckets is low overall, then we can achieve constant lookup time. One way to organize the bucket is by using a singly linked list. This has the side effect that, if the buckets are overloaded, then lookup time approaches the time of the singly linked list, which is O(n).

An obvious improvement is to make the bucket a binary search tree, perhaps a splay tree (so that item being looked up more often is quicker to find). This cuts down the worst case time of overloading to O(logn), while achieving average case lookup time of O(1).

We can nest hash tables, so a bucket can be implemented by another hash table. In this case, if the load of buckets are small, then we can let two or more buckets initially share a hash table. As buckets grow in size, they split. The size of nested hash tables must be relatively prime, or the inner tables would have high collision rate and won't be effective. The hashing scheme can be organized in the fashion of radix sort where the outer table use one hash function on the most significant bits, and the inner table use another hash function on the least significant bits.

Tuesday, June 30, 2009

Kernel page tables

My office mate has been working on an operating system kernel, and apparently page table manipulation is very error prone. Here are some important things that are easy to forget:

Special memory I/O pages cannot be freed into the general page pool, such as the physical address used for I/O APIC. If it happens, then another program would allocate the memory I/O page and then write to that physical address as if it's RAM.
If two address spaces share a page (mmap w/ MAP_SHARED, or shmat), then pages need to be reference counted.
If two processes share the same page table (clone system call), then page tables need to be reference counted.

Since he's too engaged in debugging, I'm writing these down for him in case he forgets. I said to him, maybe the reason why people haven't made ground-breaking research into redesigning operating system architecture because the moment you walk in, you get lost in the woods, and you don't see the forest anymore.

Wednesday, June 17, 2009

Super-Sized Del.icio.us, or Down-Sized Wayback Machine

The problem with URL bookmarks is that they can always become dead links. Here are some of the common reasons:

It takes people and resources to keep web servers running, and the company or institution that maintains the web site just went away, bringing their web site with them.
The page disappears because the webmaster no longer deems the page worthy of keeping. It could happen when a web site is revamped, so some pages just disappear while others have their URL changed to fit the new site structure.
Pages can be moved around as part of a minor reorganization effort.
Content agreement mandates that the page is only accessible for a certain period of time.

Furthermore, even if the page still exists, it may have been edited. The information that you wanted to be bookmarked may no longer exist. The Internet Archive Wayback Machine solves part of this problem by crawling a given website constantly at different dates, and by allowing you to retrieve a version at an earlier date. However, it has its own problems:

It only does best-effort crawling. In particular, it may take up to 6 months for the URL you submitted to show up.
It is a web crawler, so it voluntarily observes robots.txt.
The web page it archives is often missing images and other constituent material.
It has to rewrite hyperlinks in the web page.
It won't work with Javascript and XMLHttpRequest (AJAX).

Furthermore, the Wayback Machine often crawls redundantly and unpredictably. It may attempt to crawl other parts of the website that you're not interested in, omitting the part you actually want. It may crawl the URL when you don't need it, and not crawl when you do.
Here I propose a service that takes a complete snapshot of a web page in the manner of a bookmark. This will be an immensely helpful research tool for historians and other people who need to keep a small piece of artifect for a web page they're interested in.
Web browsers have had the ability to cache web pages on disk, so static content doesn't have to be redownloaded everytime the web page needs to be rendered for display. The idea is to build a more sophisticated caching feature on top of a web browser.

All HTTP connections are memoized by the snapshot, so Javascript and even some Flash content will work.
Support multiple snapshots for the same URL at different time and be able to display them side by side for comparison. This is not something a proxy server can do.
Snapshots are searchable.
Schedule regular snapshots or prefetch web page.

Possible advanced features:

Store snapshot on external server. Pro: can consolidate snapshot of the same webpage taken by two different people, if they are the same. Pro: can allow public access to snapshots. Pro: high availability because customer doesn't have to manage their own storage. Pro: high mobility because snapshots live on "the cloud." Pro: centralized search and indexing. Con: will cost money to run storage server infrastructure, so service will not be free. Con: not likely supported by ads revenue.
Allow public access to the snapshots. Pro: any website can link to a snapshot to avoid the external dead link problem. Con: will run into copyright issue when the original web site owner determines he is losing profit because of the snapshot.

Technologies that can be leveraged:

QT for cross-platform user interface. Doesn't require commercial license; LGPL is fine.
WebKit (available in LGPL) or Presto (commercial proprietary license) layout engine for rendering web pages.

Saturday, June 6, 2009

Multiresolution NLE video codec

Highly compressed video codec such as H.264 is often too CPU intensive for real-time editing. Their inter-frame compression also makes frame-to-frame editing difficult. A typical workflow to work with these videos is to transcode them to an intermediate codec, which uses lower complexity, intra-frame only compression, so that non-linear video editing software can manipulate the video frame by frame in real time. Intermediate codec typically compresses at a much higher bitrate to make up for lower complexity (though still more efficient than uncompressed video); as a result, it requires higher disk bandwidth.
In order to work with computers with slow disk (i.e. laptop), sometimes it is desirable to edit with reduced resolution video. When the editing finalizes, rendering is done using full-resolution footage. However, during part of the workflow, it may be desirable to use full-resolution (e.g. compositing), but other times half or quarter resolution (timeline editing). One would transcode into the intermediate codec in multiple resolutions, which is a waste of disk space and a headache to manage.
Two popular intermediate codecs are Apple ProRes 422 and Avid DNxHD, and neither of them tackles this issue. Here is my idea. A plausible construction of an intermediate codec is to just JPEG encode video frame by frame, so I'll use that to illustrate. We can encode a frame progressively as follows.

Given a video frame, produce ¼×¼ and ½×½ resolution images.
Encode ¼×¼ image using JPEG, decode it, and scale it up by 2x (i.e. to ½×½ the original size). Make a difference image from this one and the original ½×½ image.
Encode the ½×½ difference image using JPEG, decode it, and scale it up by 2x (i.e. to the original size). Make a difference image from this one and the original image.
Encode the original-sized difference image using JPEG.

This allows quarter resolution, half resolution, and full resolution reconstruction of the video stream. Data chunks in the stream can be arranged so that if we want to decode lower resolution picture, we don't have to read the higher resolution data.
Some issue that may not work well:

JPEG causes compression artifect that is eccentuated by the layered encoding scheme.
Total decoding time for full resolution is tripled (bottleneck being memory bandwidth rather than ALU).

Update (6/23/2010): apparently JPEG "progressive encoding," which allows an image to be loaded with gradually finer details (not the same as progressive vs. interlaced frame format), may pave way to how this can be done. The video container would need to become progressive encoding aware. Wavelet encoding in JPEG2000 would also have a similar progressive encoding scheme.

Thursday, June 4, 2009

Multiple definition of... extern inline

When I tried to compile gnutls-2.6.6, I got a lot of these:

.libs/gnutls_compress.o: In function `__strcspn_c1':
/usr/include/bits/string2.h:972: multiple definition of `__strcspn_c1'
.libs/gnutls_record.o:/usr/include/bits/string2.h:972: first defined here
.libs/gnutls_compress.o: In function `__strcspn_c2':
/usr/include/bits/string2.h:983: multiple definition of `__strcspn_c2'
.libs/gnutls_record.o:/usr/include/bits/string2.h:983: first defined here

It turns out that gnutls decides to enforce ISO C99 standard on all its source files, but a lot of the glibc header files use extern inline, which causes GCC to emit a symbol for each extern inline functions. GCC made a change to the extern inline semantics in GCC 4.3 in order to be ISO C99 compliant.
The fix I chose is to add -fgnu89-inline option to GCC throughout. It can be accomplished by invoking configure script like this:
./configure [config_flags...] CFLAGS=-fgnu89-inline
And this solves the problem.

Tuesday, June 2, 2009

Editing PDF Files Using Open Source Software

Here is a workflow that allows quite fancy altering of PDF files using only open source software. I've only used it to add overlays, so your mileage may vary.

Use pdf2svg to convert the PDF to SVG. Since PDF may contain multiple pages, each individual page must be converted to its own SVG file, but you only need to convert the pages you will edit. If you are not editing a page, you don't need to convert it but keep the original PDF file; we will use it later.
These SVG files from (1) can be edited natively in Inkscape. Use "Save as..." and save the result using PDF via Cairo (*.pdf). I did not convert text into path.
Follow the suggestions in How to concatenate PDFs without pain; I used the last pdfpages .tex approach that allows me to assemble from various PDF files. This requires pdfLaTeX, which comes with a TeX distribution, e.g. TeX Live. The .tex file must be processed through pdflatex and not other variants of latex command.

The SVG output from pdf2svg is a bit clunky. Not sure if that's because fonts are embedded. The PDF output saved via Cairo is also a bit clunky, but the save is quick. Finally, pdfLaTeX runs fairly quickly and produces an optimized file comparable in size to the original.

ARM processor notes

The first thing I knew about ARM processor is that, in Glibc, they don't support atomic machine instructions, relying on the Linux scheduler to grant atomic uniprocessor access in order to implement a somewhat atomic compare-and-swap operation. However, it does mention that, the new ARMv6 specification has ldrex/strex (load-exclusive and store-exclusive) instructions.
Some notes about ARM lineage from the ARMv6 whitepaper: ARM10 processor implements ARMv5 specification (1996-), which is uniprocessor only). ARMv6 (2000-) adds shared-memory multi-processor support that is implemented by ARM11 family of processors (2002-).
The ldrex and strex instructions are no different than load-link/store-conditional, a primitive form of hardware transactional memory. They can be used to implement compare and swap.
Although Glibc doesn't support ldrex/strex, the Linux kernel apparently does.
I am interested in compare and swap implementation in ARM because it is a fundamental building block for scalable shared memory multi-processor programming, used by software to implement non-blocking algorithms. If ARM wants to invade the server space by offering low energy usage data centers, they would have to do that.

Friday, May 29, 2009

Revival of IBM ROM BIOS

Hypervisor (or virtual machine manager) seems to be a hot topic these days. It is a thin layer of protection domain that arbitrates resources among different operating systems (OS). This facilitates running multiple OSes on one machine concurrently. Some of the claimed advantages are:

Better protection in case of system compromise. If an attacker gains access to one OS, other OSes are not affected. Furthermore, for a hypervisor that supports snapshotting, if a system is compromised, its image can be restored to previously known good state.
Consolidation of hardware and flexible control. An OS image can move from machine to machine. This can be used to keep more machines utilized; the other underutilized machines can powered off to save energy.

To operating system researchers, hypervisor provides much easier development environment. Virtual machines like Qemu provides a gdb stub so the kernel can be debugged in gdb. However, an OS can only attract developers if it is in a functional state, and a significant barrier to get an OS into a functional state seems to be writing device drivers. A typical OS involves dealing with:

Power management (ACPI)
Disk controllers (Parallel ATA, AHCI)
Video display (vendor specific, but most have VBE BIOS for basic non-accelerated 2D graphics)
Networking (vendor specific)
Sound card (also vendor specific), and
USB or Firewire peripherals (OHCI and UHCI).

It may have to support legacy PS/2 keyboard and mouse devices as well as serial port. A virtual machine hypervisor, on the other hand, would implement an anti-device driver, which receives I/O requests from the guest OS and act upon them like real devices would do.
A hypervisor is sometimes designed to run on top of a host OS, and its anti-device drivers must translate raw I/O request back to the high-level I/O interface that a host OS provides. This translation is necessary so it can run unmodified OS that supplies a wealth of device drivers. But this translation is a waste of effort for operating system researchers who want to get an OS up and running quickly. They do not have resources to hire developers to write device drivers. Their own time can be better spent than reading device specifications.
When Sony introduced PlayStation 3, one of its more interesting capability is to be able to run Linux on it. Linux doesn't have direct access to most of the devices on PlayStation 3 (most notably the video card RSX hardware). However, hypervisor on PlayStation 3 provides a hypervisor call API that allows restricted frame buffer, hard drive, and USB access. Most of the Linux device drivers on PlayStation 3 are straightforward hypervisor calls. These hypervisor calls are undocumented, but their functionality can be deduced by how a call is used in the open source Linux kernel.
Back in the MS-DOS days, IBM ROM BIOS provides rudimentary drivers to interface with devices via BIOS call.

A jiffy counter in the BIOS data area around 0x400, updated by IRQ1/Int 8h handler.
A keyboard driver handles key-presses on IRQ1/Int 9h by putting keystrokes in a simple ring buffer, which is read using the keyboard service software interrupt Int 16h.
Floppy and fixed disk (hard disk) controller handling IRQ5 and IRQ6. It also provides a synchronous, blocking disk I/O service via Int 13h.

There is also video service via Int 10h, but it is overridden by most graphics cards to support VBE. VBE is still used nowadays in the Intel X.Org driver to query video modes.
MS-DOS provide basic file system, character I/O devices (serial port, printer, and console), memory and process management. Most applications still use BIOS calls if they want to. Computer viruses were particularly interested in intercepting BIOS calls to find opportunities to infect disks. However, games had to write their own video and sound drivers because these hardware types were not as standardized.
Sony PlayStation 3 hypervisor call looks very much like BIOS calls.
The PlayStation 3 hypervisor does not run multiple OSes at the same time, so it does not arbitrate resource allocation. However, its purpose is to restrict access to their hardware, so the hardware interface can remain proprietary.
A general purpose hypervisor, however, can also provide hypervisor calls. This allows better integration of guest OS with the host. For example, users in the host OS could resize the virtual machine window, and the guest OS is notified of resizing. The hypervisor could also provide OpenGL support for 3D acceleration. Many commercial virtual machine products accomplish this by supplying a proprietary guest-addition driver that bypasses hardware emulation and uses undocumented hypervisor API.
Now, it would be nice to have a standardized hypervisor API, so operating system researchers can put together a fully functional OS quickly to gain support, focusing only on the architecture and less on device drivers. Such experimental OS can also be run on hardware with a non-multitasking hypervisor (like Sony PlayStation 3), providing abstract hardware access. In essence, it would be extremely convenient for operating system research and development if we have something like ROM BIOS back in the days.

Friday, May 22, 2009

Interview Question

Yet another interview question that I came up with. This one tests the interviewee's ability to read and understand basic specifications.
You have a C program written to run on Linux that uses strndup(3), size-bounded string duplication. The function returns a copy of a source string only up to n characters. The copy is allocated using malloc(), and the copy is always zero-terminated even when the source string may not be. When you try to compile the program on Mac OS X, you found out that their Standard C library doesn't have strndup().
After browsing around, you found the functions strlcpy(3) and strlcat(3) which are the size-bounded string copying and concatenation. Can you use them along with malloc() to implement strndup()? If so, show your work. If not, explain why not.

Monday, May 18, 2009

autoheader, automake, autoconf

Some notes about these wonderful tools.

When incorporating and distributing third-party source code, autoconf can bootstrap recursively the third-party configure script. Specify using AC_CONFIG_SUBDIRS([third_party/src/...]). Additional configure arguments for the recursive invocation can be added to $ac_configure_args right before AC_OUTPUT.
Modernized configure.ac uses AC_INIT([name], [version], [desc]). To add automake capability, just use AM_INIT_AUTOMAKE without arguments.
Autoheader is used to generate config.h.in from configure.ac, which derives the configure script that converts config.h.in to config.h. The config.h file contains C preprocessor macros that affect compilation behavior depending on the outcome of configure script. Autoheader either requires AC_DEFINE to provide a description (third argument) or AH_TEMPLATE([VAR_NAME], [desc]) to exist in configure.ac.
To check for packages using pkg-config, use PKG_CHECK_MODULES(VARIABLE-PREFIX, MODULES, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND]) macro (pkg.m4 seems to be the only source of documentation). If [ACTION-IF-NOT-FOUND] is blank, then the default is to fail and quit configure script. Supply [true] to make this package optional (i.e. if package not present, continue gracefully).

Wednesday, April 8, 2009

Vehicle warranty is about to expire

I hate getting these phone calls. They always tell me my vehicle warranty is about to expire. I don't even have a car.

Today they called my cell again. I figured it was that kind of phone call because the caller ID number looks unfamiliar. I picked up the phone and whistled into the phone the "intercept" special information tone (which indicates the number has changed or disconnected). You hear this tone when you dial an incorrect number. I'm sure I didn't get the pitch perfectly, but after I whistled the tone four times, the unsolicited caller promptly hung up. I guess that was successful.

I'll have to see if the phone company complains to me for phreaking the phone's signal system, and if I really do stop getting the vehicle warranty call.

Wednesday, March 4, 2009

stdio.h discard a line from input

There are a few different ways to phrase this question, all relating to stdio.h input handling:

What is a better way to discard input other than using fgetc() to read it character by character?
How do I discard a line without using fgets() to store it first? It may not clear the line completely due to limited buffer size.

Through creative uses of scanf() this can be done. The format string of scanf() looks like printf(). Rather than passing the value to be printed, you pass a pointer to some memory location to store the value that is parsed. In addition, you can use an astrisk '*' modifier to tell scanf() to discard input.

The first try, scanf("%*s\n"), doesn't really quite do what we wanted. The format "%s" matches only non-white characters (space, tab, newline, etc), so we only discard up to the nearest white space. The "\n" eats another white space (not a newline, contrary to intuition).

The Glibc version^* of scanf can match a sequence of characters of any length given character ranges in brackets, sort of like POSIX regular expression character classes. The matching does not care about white spaces, so scanf("%*[^\n]") discards all non-newline characters from the input. But that doesn't quite do the trick yet. The input buffer still has a newline character.

Finally, scanf("%*[^\n]\n") does the trick. It first discards all characters leading to a newline, then discards a white character which must be a new line.

^*Update: appears to be ANSI C3.159-1989, so this feature should be portable to any conforming implementations. AIX has it. BSD has it. Glibc is the only one I tried.

Monday, March 2, 2009

C++ undefined vtable symbol

I was recently puzzled by a student's problem using GCC to compile some C++ code. He compiled the partial code after every changes in order to make sure he didn't screw up. At one point, the code compiled but gave this linker error:

Undefined symbols:
"vtable for PRNG_Uniform", referenced from:
__ZTV12PRNG_Uniform$non_lazy_ptr in ccBZAsss.o
ld: symbol(s) not found
collect2: ld returned 1 exit status

Searching on the web didn't yield any meaningful information about the error message. It took me a while, but I finally realized that it just means "there are some virtual methods you haven't defined; please define them."

In this particular example, the idea is to have an abstract base class PRNG (pseudo-random number generator) that derives PRNG_Uniform (uniform distributed) and other PRNGs with a variety of probability distributions (e.g. exponential, normal, or even deterministic). There is a virtual method from the base class that takes a random sample. The subclasses would implement that method. Another part of the system he implements would refer to the base class PRNG without caring what probability distribution characterizes the random samples.

The idea is that C++ compiler creates a virtual table for PRNG and each of its subclasses. The virtual table allows the method call on an object to be re-routed to the appropriate subclass implementation, even if all we have is a pointer to a PRNG object rather than to PRNG_Uniform or a specific subclass of PRNG.

In order for this virtual table to be available for the linker, all the virtual methods must be defined at link time.

Sunday, January 11, 2009

Objective-C memory management

Here is a quick summary of Objective-C memory management.

Objects are reference counted, so if you do:

NSClass *objp = [[NSClass alloc] init];

It will need an accompanying:

[objp release];

But many Core Foundation classes also have factory (class) methods that look like:

NSSomething *objp = [NSSomething somethingWithInitializer:initializer];

For example, NSArray methods +array, +arrayWithArray, +subarrayWithArray, ...; NSString methods +string, +stringWithString, ...; and so on. The convention used by Core Foundation is that the object returned by factory methods still have reference count of 1, but they are added to the "autorelease" pool. You do not release nor autorelease these objects!

The "autorelease" pool is simply a list of objects to be released at a later time. The pool is dynamically scoped (like exception frames) and is unrelated to the static (lexical) scope of the program. When the program leaves an autorelease scope, all the objects in the pool of that scope are released. An object can be added to the pool several times, in which case it will be released the same number of times they are added to the pool. There doesn't seem to be any way to take an object out of autorelease pool.

Cocoa's Event Manager always creates an autorelease pool frame prior to handling an event. The frame is popped after calling the event handler. This means your program typically creates some garbage that is cleaned up altogether after finishing an event. Objects that outlive the scope of an event must be explicitly retained.

Apparently, Leopard also supports garbage collection with the -fobjc-gc GCC flag.

Thursday, January 8, 2009

Making NULL-pointer reference legal

Normally, NULL pointer reference results in access violation and is illegal. However, it is possible to make NULL pointer reference legal on several operating systems. There are two ways to do that: (a) using mmap(2) with MAP_FIXED flag (works on Linux and Mac OS X), and (b) using shmat(2) with the SHM_RND flag (works on Linux). Consider the following program:

#include "stdio.h"
#include "sys/mman.h"

int main()
{
void* p = mmap(
NULL, 4096, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANON, -1, 0);

if (p == NULL)
printf("*(NULL) = %x.\n", *((int*) NULL));
else
perror("mmap(2)");

return 0;
}

The program, when run on Linux and Mac OS X, prints *(NULL) = 0, and that's the result of the program doing a NULL pointer reference. This doesn't work on AIX.

Uses of hugetlb

While the kernel hackers busily working out what are the reasonable performance expectation of hugetlb in the Linux kernel for allowing application to use large memory pages (>>4KB, typically ~4MB) offered by Page Size Extension, there are a few consequences for using large pages that restrict applicable uses of it. Using a large page improves performance by significantly reduces TLB misses, but the sheer size and alignment requirement of the page is the cause for concern. Assuming a large page is 4MB in size, a 4MB page has to be aligned to 4MB in the physical memory and has to be continuous, and the OS mixing 4KB and 4MB pages might run out of continuous physical memory due to fragmentation caused by 4KB pages.

An application could use large pages as a general-purpose heap, but it should avoid fork(). There are currently two ways to allocate large pages: using mmap(2) to map a file opened in hugetlbfs (on Linux only), or shmget(2) passing a special flag (SHM_HUGETLB on Linux, SHM_LGPAGE on AIX, noting that on AIX a large page is 16MB).

Using shmget(2), the shared memory backed by large page cannot be made copy-on-write, so both parent and child processes after fork() now share the same heap. This will cause unexpected race condition. Furthermore, both processes could create a new heap space which will be private, and references to the new heap will cause memory error in the other process. References to memory newly allocated in another private heap such as malloc(2) will also be invalid in the other process.
While mmap(2) allows you to set MAP_PRIVATE to trigger copy-on-write, the copying is going to be expensive for two reasons. The most obvious one is the cost of copying 4MB data even if the child only modifies a word of it and proceeds with an exec(2). With 4KB memory pages, copying a page on demand is much less expensive and can be easily amortized across memory access over process run-time. The other reason is that Linux might run out of hugetlb pool space and has to assemble physically continuous blocks of memory in order to back the new large page mapping. This is due to the way Page Size Extension works. Furthermore, the OS might also require the swap space backing a large page to be continuous, needing to move blocks on a disk, causing lots of thrashing activity. This is much more expensive than simple copy-on-write.

A large page could also be used as a read-only memory store for caching large amounts of static data (e.g. for web servers), so it doesn't have to be made copy-on-write. It could be made read-write but shared if memory access is synchronized across processes, for example by using SVr4 semget(2) or by using lock-free wait-free data structures. This is appropriate for database servers.