Saturday, November 30, 2013

To err is human; to out, pipeline.

Many introductory programming classes begin with the hello world example like this:
#include <stdio.h>

int main() {
  printf("hello world!\n");
  return 0;
From this point on, the notion of writing human readable messages from a program to “standard output” is cemented in the student's mind, and a student continues to do so for the rest of his professional career. This is bad because this is not the intended design of computer systems. The intent is that “standard output” (abbreviated as stdout in a program) be machine readable, while “standard error” (abbrev. stderr) is human readable. A corrected version of hello world would be:
#include <stdio.h>

int main() {
  fprintf(stderr, "hello world!\n");
  return 0;
Pipeline is a feature of Unix that allows you to chain the standard output of one program to the standard input of another. Simple programs can be arbitrarily composed to perform complex functions. It is also a feature of many operating systems inspired by Unix, such as MS-DOS, and its successor, Windows. Standard error is used to report errors, so you typically don't pipeline standard error.

As an example, if a text file contains your personal address book, one contact per line with space separated fields like this:
$ cat contacts 
John Smith 555-1234
John Doe 555-2345
Jane Smith 555-1234
Jane Doe 555-2345
Alan Smithee 555-3456
Smith Johnson 555-4567
Then you could list all last names like this:
$ cut -d' ' -f 2 contacts 
Enumerate the Smiths (last name):
$ cut -d' ' -f 2 contacts | grep '^Smith$'
And count the Smiths:
$ cut -d' ' -f 2 contacts | grep '^Smith$' | wc
       2       2      12
You can sort and list the unique last names:
$ cut -d' ' -f 2 contacts | sort -u
You can do similar things to email and phone area code, all using simple programs that don't do very much on their own, but can be combined using pipeline to perform specific queries or text manipulations:
  • cut: extracts the fields of input lines.
  • grep: prints input lines matching a pattern.
  • wc: counts lines, words, and bytes.
  • sort: sorts and optionally prints only unique lines.
This doesn't work if one of the programs in the pipeline would output error message to standard output.

For a beginner who might confuse the roles of standard output and standard error, the following mnemonic should help.
To err is human; to out, pipeline.
Maybe humans do err too much.

Friday, November 29, 2013

Aligned data transfer, or Why Headers Must Come Last

When a program on computer A wants to send data to a program on computer B, the data undergoes a series of nested transformation where, at each stage, a header is prepended to the data. Prepending is costly because data must be copied to a new location in order to make space when inserting, or to eliminate space when removing headers, assuming the alignment of data in the destination is important. Alignment is important because of the way memory system is divided into pages, and the way permanent storage is divided into sectors.

The current protocol stack design relies on prepending so much that any implementation inevitably performs a lot of data copying. Copying data is detrimental to performance because of wasted CPU cycles and memory bandwidth. If headers come after the payload, we can append and remove off headers easily with simple memory management, all without affecting payload alignment.

Header prepending is generally not a problem for the sender because network cards support gathering (i.e. vectored I/O) which incurs no cost for packet assembly, but the receiver cannot in general predict the header size. We can program the network card to scatter the payload to a well-aligned memory address only if we know the header size in advance.

Consider sending data as a typical TCP/IP packet. The data undergoes the following transforms:

  • Optional application-level encryption and integrity checking. This adds frame header to tell apart negotiation from data transmission, and also a cipher header. See TLS/SSL.
  • TCP header is prepended by the operating system to encapsulate the data within a logical stream socket, for establishment of connection, in-order delivery, and flow control.
  • IP header is prepended by the operating system for routing information, i.e. source and destination of the packet. Both IPv4 and IPv6 are in active use.
  • Other host-level encapsulation, e.g. VPN, PPPoE.
  • Depending on the link layer, either the network card or the OS would prepend a link layer header. This allows the communication hardware to identify itself over the medium.
  • Other encapsulation and decapsulation headers (e.g. MPLS, DOCSIS) are not visible to the operating system.
Predicting header size is hard primarily because of the dual-stack and host-level encapsulation. One can argue that because host-level encapsulation typically transmits data over slow and high-latency WAN links, performance is less of a concern. But think about the Internet backbone routers that might also need to encapsulate. Any saving in their CPU cycle and memory bandwidth is an increase in overall Internet performance.

If we append instead of prepend headers (of course header would be a misnomer now), headers can be removed by truncation. We still need gathering so that the operating system does not need to rely on the application to allocate sufficient buffer size to accommodate for all headers. Now the scattering only needs to focus on receiving a jumbo packet across multiple discontinuous memory pages.

Network technology can continue to scale, but with the current header prepending protocol design, this requires the CPU and memory technology to scale 2x or more than the network. This is unrealistic, so we must redesign the network protocols to append headers instead.

Wednesday, November 27, 2013

REPL kernel and CPU scheduler

Today I saw a picture of BeOS and its kernel debugger. It had occurred to me that a debugger is really the read-eval-print-loop (REPL) of the crude assembly or C language. It makes perfect sense why it's a good idea. BASIC, Scheme, Standard ML, Python, and JavaScript are some example languages featuring a REPL. GDB even features a limited REPL that allows you to call an existing function when debugging a C/C++ program. The REPL makes it easier to inspect the program state and try out some modifications of the program while it's running. It's a powerful rapid prototyping tool. Any serious programming environment must support REPL.

Kernel debugger must be in some way tied to the CPU scheduler, since it needs to suspend execution of the kernel and inspect its state in suspension. This leads me to consider in a very abstract sense how to write a CPU scheduler in general.
void scheduler() {
  while (1) {
    task_t *task = next();  // OS sched. alg.
    if (!task) {
      halt();  // CPU specific.
    run(task);  // CPU specific.
    put(task);  // OS sched. alg.
The scheduler is really an event-driven loop using some CPU specific subroutines and OS scheduler algorithm functions.

The CPU specific functions are:
  • halt() makes the CPU idle, possibly enters power-saving mode, until an interrupt occurs.
  • run() context switches into the task and returns when an interrupt occurs.
Interrupt is the primary way to allow the scheduler to regain control. For example, preemptive scheduling can be implemented by setting a timer interrupt to force run() to return when the interrupt is triggered. It could be a periodic or a one-shot interrupt; we don't really care. For hardware that only has one single-shot timer, we can easily write a timer multiplexer.

To make things simple, assume that everything in the kernel and the user processes can be structured as a task. The run() function would have to discern what type of context switch is appropriate for what type of task.

The OS scheduling functions are:
  • next() to fetch the next task in the scheduling queue that is ready to run.
  • put() places the task back into the scheduling queue.
The scheduler itself has no notion of priorities. It is next() that returns the next task to run. But depending on the implementation of the queue, the actual scheduling decision can be made in either next() or put(). Furthermore, we defer the notion of task completion to put(). If put() determines that a task is done, it will dispose the task in some other way than putting it back to the queue.

In a multi-processor system, each CPU will run its own scheduler and have its own task queue. The next() function might attempt to steal task from another CPU's queue if the current queue is empty. The queue takes ownership of a task, but the task can migrate from one queue to another. Work stealing of distributed queues is a good load balancing strategy, and I think it's the only such strategy that is provably scalable.

Within this scheduler, one possibility is to represent the REPL as a task and schedule it like all other kernel tasks. This is the simplest, but the REPL would run simultaneously with all tasks. It cannot be used to debug the scheduler or examine a freeze state of the CPU. One the other hand, having a REPL that can look at the live state of the kernel is pretty cool. Suspending the kernel is quite easy: simply allow uninterruptible kernel tasks in run(), then REPL can switch between live and freeze states by toggling the interruptible flag of its own task. You can also suspend other CPUs using a boolean variable to force their next() to all return the nil task which makes the CPU halt. You don't need to disable their interrupts. To resume, just reset the variable.

Even so, such REPL won't allow you to step into or over the scheduler or interrupt handlers. A true SoftICE styled debugger is only possible if you hijack the CPU specific implementation of run(). The good news is that it should be able to coexist with our REPL which is already pretty useful.

Friday, October 4, 2013

Micro Kernel, a Historical Perspective of Distributed Systems

In the 1980's, microkernels were developed as a response to the complexities and flakiness of a monolithic kernel. By introducing modularity and abstraction at the run-time level, the fundamental functionalities of a kernel is broken down into manageable services that are easier to maintain. These distributed services communicate by remote procedure call. Remote procedure call differs from local procedure call in that data dependency of the function call argument has to be made explicit because the two services are in different address spaces, not necessarily sharing the same memory. One way to satisfy data dependency is by copying, but the overhead of copying might be unacceptable if a lot of data is transferred this way. Another approach is the out of band buffer transfer. I have written about these issues of interprocess communication before.

These days, programming languages have the ability to enforce modularity and abstraction at compile time, so each module could be run safely in the same address space even without hardware protection. As long as someone is willing to port a monolithic kernel to a new language (i.e. rewriting it), the initial motivations of the microkernel—modularity and abstraction—are largely addressed. But this is not to say that microkernel has become irrelevant.

Around 1970's, there have been effort to network computers that can talk to one another. In the 80's, it was recognized that a cluster of interconnected computers should work together in parallel in order to speed up computation and scale up the amount of data that could be processed. They coordinate computation, again, by interprocess communication. Scaling by parallelism continued as computers are also made faster, denser, and smaller. Fast forward until today, the speed of a single computer has approached its quantum limit, so processors have gone multi-core, which is to make a processor a package of multiple computing units, each capable of acting as an autonomous computer.

The difference of modern multi-core with the 80's design is the level of abstraction at which interprocess communication takes place. In a modern computer, even when the CPU and the main memory are all distributed nodes, the hardware hides much of this and still presents the illusion of a single computer. But scaling general purpose hardware is much more difficult than scaling software which tends to be domain specific. Domain specific hardware does exist when the economy sustains it, for example the torus network topology is designed for parallelizing convolution problems such as weather simulation, funded by national weather service, and the GPU is designed for graphics rendering, which is an embarrassingly parallel problem, in video games, movies, as well as industrial visualization.

On the one hand, we have single computers that have become a distributed system of multi-core nodes using a vendor-specific interconnect (QPI for Intel and HyperTransport for AMD). Barrelfish is an operating system that has recognized and embraced this in its design. On the other hand, we have cheap system-on-a-chip (SoC) computers like Raspberry Pi that can be built into a cluster of computers over the Ethernet, using some TCP/IP based RPC for distributed computing as back in the 80's. As one of the operating system's purpose is to abstract the underlying hardware, if we were to reconcile the differences of these disparagingly different yet similar architectures, we arrive at a design of distributed processes coordinated through interprocess communication, similar to that of a microkernel.

Even in the 80's through 90's, visionary researchers have foreseen this need, and as a result we have distributed operating systems like Plan 9 (which begets the programming language Go) and Amoeba (which begets Python). Although the distributed operating systems have never been very popular because the hardware running them weren't, now that the hardware is becoming more popular, so will the distributed operating systems.

Saturday, September 28, 2013

Unified Binary JSON and Cap'n Proto Wire Format

Earlier I wrote about binary JSON wire format which is inspired by MsgPack but allows string chunking and compact representation of array and map. Then I wrote about Cap'n Proto which transmits fields of a struct as it is laid out in memory so it needs to encoding and decoding, but I lamented that the wire format gets complicated because you actually have to do some memory management in order to cope with strings and repeated fields. Then it occurred to me, why not take the best of both worlds?

The idea is that, after transmitting the memory layout of a struct, the data would be followed by a series of editing instructions for populating non-integral fields such as arrays and strings and other objects. Since the memory layout may need to be aligned, we need to allow padding so that we can access structs through e.g. memory mapped I/O.

For the sake of simplicity, let's consider a struct that contains only int64 fields and object pointers. Smaller integers are packed as bit fields, e.g. two int32's per int64. Object pointers are only used for strings, arrays, or other composite objects. On a 32-bit machine, only the immediate half of the int64 field is used for the object pointer.

In the previous proposal, tags 0xAE and 0xAF are reserved, so we can use them for encoding struct with in-memory layout. The logical wire stream would appear like this:

  • Tag 0xAE: struct with big-endian in-memory layout.
    • Followed by an integer object n for the number of int64 fields in the struct.
    • Followed by a short string object of lengths 0 through 7 (actual object length is 1 through 8). This string is only used for padding and is ignored.
    • Followed by n * 8 bytes of struct field data in big endian.
    • Followed by the edit map object whose keys are field numbers and values are objects to be stored at the field. Repeated keys may be used to populate a repeated field in the same order the values occur.
  • Tag 0xAF: like 0xAE, except struct field data are little-endian.
In the case that the reader is given an a-priori schema, it can perform the field-editing on the fly according to the schema and reject bad values in the edit map. In the case there is no schema, the reader might save the edit map and postpone the editing for later until the struct is bound with a schema. The reader may attempt schema-less on the fly editing by keeping track of value type conflicts for a given field number. Without a schema, the types of the missing fields would be assumed to be int64 unless overridden.

As far as the writer is concerned, a struct object can be written either as big-endian or little-endian, and it can optionally be written out simply as a map (as opposed to 0xAE or 0xAF tagged objects) to maximize schema-less compatibility and/or efficiency in encoding.

Saturday, August 24, 2013

Revamping signal system for the subway

The other day I was watching an old episode of Extreme Engineering, Subways of America. One of the problems with scheduling is collision avoidance. In order to reduce the likelihood of collision, you space out the trains or keep them moving slowly. But that means the utilization of the track becomes low or you increase commuter time. Collision is a huge problem for subway because visibility in the underground tunnel is low, and once an accident happens, there are only very limited escape options. A smoking fire can quickly turn into a disaster. The signal system allows collision avoidance in a more predictable way, so that efficiency can be improved.

Although the episode focused more on NYC subway, I take the MBTA every day to work, and I often wished its efficiency could be improved. The signal systems of the mass transit lines, red and blue, have been recently revamped (I don't take the orange line so I don't know). The green line, on the other hand, is still using an old system. They basically stick a signal light with a sensor and a timer to the right of the track. When a train passes, the sensor detects the train, and turns the signal light red. After 10 seconds, the signal light turns yellow, and another minute the signal light turns green. (I have not actually timed the intervals. I just pulled them out from memory). Signal lights are spaced every some distance. When you hear MBTA announce signaling problem, one of the light bulbs probably just burned out or the light got stuck in green.

What a signal system boils down to is a way to estimate distance between two trains, so the conductor can adjust the speed of the train accordingly. But installing signal systems can be expensive particularly if it requires new wiring along the tunnel. It requires taking parts of the tunnel offline, and servicing is expensive. The ingenuity of the timed signal lights is that it's a decentralized, cost-effective solution that worked well 100 years ago when the pace of life was slower.

I propose a similarly decentralized signal system where each train gains the visibility of its distance from other trains without the need to install new wiring to the tunnel. The key idea lies in applying Broadband over Power Lines technology which converts the "third rail" or the power line into an ether where communication can take place. Similar to how GPS and NTP systems work, trains will broadcast time-stamped messages at a given interval, and by measuring the round-trip time of these messages, trains can estimate the distances between each other. Stationary beacons can be placed at fixed distances on the track so trains can also track their absolute position.

The stationary beacon also detect if a train is passing without the active broadcasting working. If so, it sends a warning to all trains. The train immediately behind the warning beacon will ask the beacon how long ago the no-signal train has passed and adjusts its speed accordingly, essentially reverting back to the old system. The system is successful if it would allow both old and new trains to work on the same track. A graceful upgrade option is also a graceful downgrade option.

Now done with the logistics, here is the technical part. Light travels about 30cm each nanosecond. This is called a "light-foot." Electric signal travels at similar speed. We can convert the round-trip time to a distance, but we're constrained by the accuracy and precision we can measure time. Computer processors run at 1GHz or more, so for the sake of discussion we can equate 1 nanosecond to 1 CPU cycle. It is generally possible to use off-the-shelf computers for this purpose, except off-the-shelf operating systems are not to the task. We need a dedicated real-time operating system that can achieve latency in just a handful of CPU cycles. Since main memory access incurs cycle penalty of 20 cycles or more, the system will run entirely off the CPU's L2 cache.

Using a general programmable CPU has the benefit that we can easily design and test the time-distance measurement protocol to make it safer, more efficient, and more robust. On the other hand, if this is programmed in hardware, then it would take many years. GPS has its root from 1940's, with initial deployment in its current form in the 1990's. It took GPS more than 20 years to reach current maturity. Although we're not launching satellites, the positioning system proposed here requires faster response time and more precise distance measurement, and will likely not be doable in another 10 years in hardware. I think we can already develop the technology in software in a matter of a few years.

Friday, August 23, 2013

Towards a decentralized social network

These days, social network platforms are a great way to share with people aspects of your life, but once you upload your life to the provider, you have very little control over it. Gaining complete control of your data means you have to host it, but hosting a site can compulsorily expose other private information (e.g. WHOIS) that you don't want to share. One solution is to host an individual's social network profile over TOR hidden services. Here are some notes about how social network might work in a completely distributed fashion using OpenID and OAuth.

A personal profile is an URL. The URL is an HTML page with link rel serving an OpenID. The HTML page can also be a Other link rels can also expose RSS feed that your friends subscribe to. Which posts your friends get over RSS depends on their OpenID when they log into your profile site. The streams page where you read your friends post is simply a custom RSS reader. For this to work, you also need to log into all your friend's profile site in order to read their posts. You can use cookies to keep the authenticated state, but this is not ideal.

OAuth is an alternative which works differently. When you log into your streams page and wants to see the posts from all of your friends, their site would request your site to provide them with an OAuth access token. The OAuth would allow them to access just your public information enough for them to decide that you are trustworthy through OpenID Connect. They would then publish an RSS appropriate for your identity. There are two examples: Google+ Sign-In API and Facebook OAuth dialog.

Unfortunately there is no standard way to interact with other social networking services after an OAuth token is granted, e.g. posting messages publicly or privately to the other user.

Wednesday, August 21, 2013

Essence of a secure socket protocol

This evening I went back and read RFC 5246 Transport Layer Security Protocol 1.2 just for the heck of it. It is pretty complex. Pages 7 through 14 define how to serialize messages. Pages 15 through 63 define the structure of different message types passed between server and client and their meaning. The remaining few pages define how client and server may use a shared secret to exchanging data.

A secure socket protocol really boils down to this. Alice and Bob wish to talk to each other, so they send each other their public keys. The public key is used to encrypt a session key for the recipient which is subsequently used to encrypt the actual message from sender. After receiving each other's public keys, they send the encrypted session key, and encrypted data would follow.
Sends public key to Bob.Sends public key to Alice.
On receipt of Bob's public key, tells Bob about how subsequent messages will be encrypted (session key). On receipt of Alice's public key, tells Alice about how subsequent messages will be encrypted (session key).
Starts sending ciphertext to Bob.Starts sending ciphertext to Alice.
Since TLS is layered on top of TCP, phase one begins after the TCP 3-way handshake, and TLS handshake is at least 6 ways. In total this takes 9 one-way trips or 4.5 round trips.

Phase 1 can be used to advertise the capabilities of the owner of the public key, so that the peer can choose an appropriate encryption and/or compression method. TLS also defines a mechanism to resume the use of a previous session key. This can be folded into Phase 1 as an option. The additional information sent for Phase 1 and 2 must be signed by the sender's public key. Authenticity and integrity of phase 3 communication is done using HMAC with the session key.

Here are some more ideas if I were to design my secure socket protocol from scratch.

A lower level secure socket protocol can embed Phase one with the SYN and Phase 2 with the ACK. If Alice is the client and Bob is the server, Alice would send her public key in SYN, Bob would respond with his public key as well as session key for alice in SYN-ACK, and then Alice will send session key for Bob in ACK. Then both can start communicating in ciphertext.

Notice that Phase 3 does not have to wait for Phase 2. Sender can immediately start sending ciphertext as soon as the session key is sent. This allows us to achieve just 1 round-trip. However, potential for network packet loss means that the sender has to be ready to resend the session key. It is probably easier implementation wise for Phase 3 to wait.

Rather than sending the public key all the times, a variation is to send a key-digest instead. If the recipient does not know the sender's key, then an additional round trip is required to obtain the key from sender. For a 4096-bit public key, this can save 512 bytes for the SYN. This is not a lot given today's high speed network.

Either party may terminate the connection early if:
  • It does not trust the peer's public key.
  • It does not understand how subsequent messages could be decrypted.
To ensure ciphertext authenticity and integrity, append HMAC to plaintext to be encrypted together. This makes it harder for a third-party to verify if they decrypted the message successfully. The receiver will decrypt first and then verify the HMAC.

The new secure socket would require extending the BSD socket API as follows:

  • connect_ex() is like connect() but allows passing arbitrary data, limited by single packet size, as part of the SYN. Upon return, the caller also gets the SYN data from the peer.
  • accept_ex() is like accept() but allows retrieving SYN data from peer as well as send its own, again limited by single packet size. This function might also take a callback which decides whether to accept or reject the peer's SYN data before sending our own.
The application layer implements its own encryption and compression so it can decide when symmetric key exchange ends and ciphertext communication starts.

Saturday, August 3, 2013

Eliminate Cache Coherence from HyperTransport

Processors are equipped with a cache to speed up memory access because main memory access speed failed to keep up with processor speed. On a symmetric multi-processor system, each processor possesses a private cache but all share a memory bus. Cache coherence protocol makes sure that all the private caches maintain a consistent copy of a portion of the main memory even though any processor may write to the main memory any time.

Things get complicated with NUMA where processors now come with their own memory controller and their local main memory. They still have their private cache. Processors no longer share a bus, but they send messages to one another through a point-to-point communication channel. A message may be routed through multiple hops in a network of processors. This is how HyperTransport works. Furthermore, a socket can actually house two multi-core dies, and on a quad socket system it can create quite an interesting topology.

In order to create the illusion of shared memory, a processor will need to talk to another processor for accessing remote main memory. The longer the hops between processors, the higher the latency. But HyperTransport 3.1 operates at 3.2GHz, so we're still talking about just 1 or 2 CPU cycles. Memory access time clocks in at 20 cycles or more.

Current generation AMD processor caches both local as well as remote main memory. Cache coherence over HyperTransport is a lot more complicated. If a processor wishes to conduct a transaction and invalidate the caches of other processors, it must send a message to all processors and wait until all of them to respond. This creates a torrent of messages over the HyperTransport link.

A recent feature called HT assist uses parts of the L3 cache to keep a record of which processors' caches need to be invalidated. This reduces the number of messages that need to be sent, increasing effective memory bandwidth.

Having said all this, here is my epiphany: what if we prohibit a processor from caching remote memory? It will only cache local memory, but will answer memory access on behalf of a remote processor from the cache. This would work well because accessing a remote cache over HyperTransport is still faster than accessing local memory. And we can do without the overheads of cache coherence protocol.

Furthermore, as more programs become NUMA aware, their structure would adapt so that they access primarily memory local to the processor, but leverage remote memory access as a mean of communication. Their performance will not be hindered by not caching remote memory, and it frees up the precious bandwidth on HyperTransport to do useful communication.

Thursday, July 18, 2013

Ideas for improving remote free for StreamFlow allocator

StreamFlow is a memory allocator that uses a simple idea to deal with producer-consumer threads, where an object is allocated from the producer thread, sent to the consumer, and subsequently freed. The object is "remote freed" back to the original heap using an atomic list insertion involving just a compare and swap. The remote free list can be vacated by the producer again using an atomic exchange. There is no need for complex lock-free wait-free algorithm.

But the remote-free lists themselves can become a performance bottleneck, as each remote free causes exclusive cache line to be migrated to the CPU performing the free. A pathological case is when any of the \(N\) threads could free to any other threads, causing cache line ping-pong and memory bus contention.

An idea is to cache an object to be remote-freed in a local free list first. For object sizes smaller than a cache line, reusing this local free list for allocation could cause false sharing, so this local free list is only for later remote free. Once this list reaches a certain length (say 1024), the objects in the list are sorted by the heap into batches, and then each batch is remote-freed in a single compare and swap.

Here is how the sorting would work. The sorter would allocate an array of 1024 pointers on stack, copy the object addresses to this array, and run in-place quicksort. Copying to the array might incur the most cache miss, but subsequent sorting would benefit by \(O(lgn)\) of cache hits. This idea came from What's the fastest algorithm for sorting a linked list? on StackExchange.

For object sizes greater than a cache line, false sharing occurs rarely, so the local free list could be used to satisfy allocation request first in the fashion of tcmalloc. It would be cleared using batched remote free if the list overflows.

Saturday, June 1, 2013

Some notes about setting up LDAP

One thing I want my pogo to do is to serve filesystem over NFS, but NFS is not terribly secure without Kerberos. I also want the uid on NFS to match across all my computers, so that means I need a centralized login like what LDAP offers. Since I might want to add Windows boxes some day when hell freezes over, I might as well throw Samba into the mix. So here I am, LDAP + Kerberos + NFS + Samba. Samba will use Kerberos to authenticate, and Kerberos will store credentials in LDAP. At the moment I decided not to use Samba4 which has its own integrated SMB + LDAP + Kerberos stack.

I installed and configured OpenLDAP on Debian. These are the things I learned.

If someone mentions modifying slapd.conf on the web, the guide is out of date. OpenLDAP now uses LDAP to store configuration files (called "olc" which stands for "on line config").

LDAP now serves at least two databases:
  • cn=config for olc
  • dc=example,dc=com for your domain database.
OpenLDAP has two sets of tools.
  • ldapsearch, ldapadd, ldapmodify: these are authenticated through the usual LDAP protocols, either through tcp (ldap://host), TLS (ldaps://host), or Unix domain socket (ldapi:///)
    • When using -Y external -H ldapi:///, the local uid is passed to the server as peer credential. For example, if you run them as root, the credential becomes "gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth"
    • Debian configures a cn=admin,dc=example,dc=com out of the box, so you specify -D cn=admin,dc=example,dc=com if you use -H ldap://host or -H ldaps://host.
  • slapcat, slapadd: these are brute force tools to access the slap database on the local filesystem.
The privilege to modify LDAP and olc is defined in olc itself. On Debian out of the box:
  • dc=example,dc=com can be written by cn=admin,dc=example,dc=com.
  • cn=config cannot be viewed nor modified by cn=admin,dc=example,dc=com.
    • It mentions olcRootDN: cn=admin,cn=config but there is no such dn in the database.
    • It allows "gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth" to modify this database.
    • Must use -Y external -H ldapi:/// for ldap* tools, or use slap* tools.
    • In theory you could change olcAccess so that cn=admin,dc=example,dc=com can view and make changes. I've decided not to bother.
Importing schema is a pain. The specific steps for Ubuntu Kerberos/LDAP is what worked for me. Generalize this for other things like Samba.
  • First create an input.conf in the old slapd.conf format. All schema prerequisites must be included.
  • Use slapcat -f input.conf -F output/ -n0 to preview the resulting ldif.
    • Identify the dn of the schema you actually want to import (sans the prerequisites if they're already imported),
    • Add -s 'cn={n}foo,cn=schema,cn=config' (change n and foo as appropriate) to slapcat
    • Pipe the ldif to a file.
  • Edit the ldif file and replace {n}foo with just foo (e.g. in the dn: and cn: entries), and remove this epilog:
    structuralObjectClass: olcSchemaConfig
    entryUUID: 76b47d96-5f66-1032-93e8-7152dc9f0505
    creatorsName: cn=config
    createTimestamp: 20130602002359Z
    entryCSN: 20130602002359.309272Z#000000#000#000000
    modifiersName: cn=config
    modifyTimestamp: 20130602002359Z
  • Use ldapadd -Y external -H ldapi:/// to import.
    • If you get ldap_bind: Invalid credentials (49), that's because you're still using the cn=admin,cn=config user which is not configured.
    • If you get ldap_add: Insufficient access (50), that's because you're still using the cn=admin,dc=example,dc=com user which does not have access to cn=config.
A special note about Samba: Debian's Samba Server Setup said I should restrict access to sambaLMPassword and sambaNTPassword. It's a good practice if you store Samba passwords in LDAP. Otherwise people will be able to enumerate the hashes and crack them. I will be using Kerberos so don't need to bother. Just configure it in olcAccess so krbPrincipalKey is not revealed. See the Ubuntu Kerberos/LDAP guide.

I will be using ldapscripts instead of smbldap-tools to add/modify users.

Once Kerberos is setup, one thing I found convenient is to authenticate to LDAP using Kerberos. This is what I did:
~# apt-get install libsasl2-modules-gssapi-mit
~# kadmin root/admin
kadmin:  addprinc ldap/
kadmin:  ktadd -k /etc/ldap/krb5.keytab ldap/
kadmin:  exit
~# chown openldap.openldap /etc/ldap/krb5.keytab
~# echo "export KRB5_KTNAME='FILE:/etc/ldap/krb5.keytab'" >> /etc/defaults/slapd
~# service slapd restart
And now once I kinit, I can run ldap* commands without entering password! Don't use saslauthd. It's a deadend.

I also find Setting up a Linux server for OS X clients, Portable Home Directories w/ OpenLDAP, and Apple OS X Integration With OpenLDAP helpful.

Sunday, May 26, 2013

Cross-compiling openldap on Debian Wheezy

Notes on problems cross-compiling openldap on Debian Wheezy and how I overcome some of the problems. I was trying to build smbkrb5pwd into openldap.
  • Problem: setting up build prerequisites:
    • Use emdebian gcc-4.4-arm-linux-gnueabi toolchain.
    • Use xapt -a armel -m <builddep> and xdeb.
  • Problem: configure couldn't find gnutls or some other dependencies.
    • Solution: use xapt -a armel -m <package> to install them (-m is for multiarch, which only works on Wheezy). Then xdeb -aarmel openldap should build it from local source.
  • Problem: libtool link complained about "liblutil.a: could not read symbols: Archive has no index; run ranlib to add one"
    • Solution: change and replace AC_CHECK_PROGS with AC_CHECK_TOOLS.  Rerun autoreconf.
  • Problem: unresolved external symbol lutil_memcmp ...
    • Solution: the library order in several and build/ are messed up. Reorder so that $(LDAP_LIBLUTIL_A) comes after $(LDAP_LIBLDAP_R_LA) or $(LDAP_LIBLDAP_LA).
  • Problem: /usr/arm-linux-gnueabi/lib/ No such file or directory
    • Solution: the specified the wrong directory ... dpkg-cross has it installed under /usr/arm-linux-gnueabi/lib/heimdal instead. Either symlink the .so or modify the .la.
  • Problem: contrib/.../Makefile sets CC=gcc.
    • Solution: in debian/rules, override in the command line like this: $(MAKE) -C ... CC=$(DEB_HOST_MULTIARCH)-gcc
  • Problem: dpkg-cross on libkrb5-dev removes /usr/arm-linux-gnueabi/lib/mit-krb5/...
    • I give up. There's gotta be a gooder way to do this...
It turns out that building smbkrb5pwd into openldap is a bad idea. It requires krb5 as a dependency, but krb5 requires libldap-dev provided by openldap, which means it a recursive dependency. Rather than putting smbkrb5pwd into the contrib package, the right way seems to be building it standalone.

Saturday, May 25, 2013

Cap'n proto inspired encoding

I guess the fact that Kenton Varda, a principle author of Protocol Buffer and the guy who built the ultimate LAN party house, came up with Cap'n Proto specifically to address the problem of serialization and deserialization overhead, is worthy of an investigation. The idea is to have a wire format that can be dumped to some memory location and used directly without decoding. But the encoding of Cap'n Proto is too complex for me to understand, so I decided to come up with my own.

First I like the idea that each field within a struct is 64-bit word aligned. Large objects like nested structs, strings, and arrays are referenced by an offset. In order to support offsetting and chunking, the wire protocol divides data to segments. When you construct a struct, you allocate memory from the segment. This is where I think it got too complicated.

Rather, let me propose a simpler wire format. A struct consists of uniquely numbered fields. Each field is exactly 64-bit. Plain old integral data will be stored as is. They can be packed, e.g. four 16-bit integers or two 32-bit integers in a field. The 64-bit field size allows storing a pointer on a 64-bit machine. This pointer is a machine and implementation specific object pointer, and is meaningless when serialized. Here is where my proposal differs from Cap'n Proto.

Immediately after each struct are a series of "fix-ups" or "field population instructions" that are basically a sequence of (field number, data type, data length, data) tuples. The field number specifies the field of the struct to be populated. Data type specifies how to populate this field with data. Data length specifies the number of bytes, followed by the action data.

For example, to serialize a struct like this:
struct S {
  // x, y, u, v are packed into field 0.
  // packed subfields must be naturally aligned to
  // their respective size.
  0: int32 x;
     int16 y;
     uint8 u, v;

  1: list<string> xs;
Then we would first dump field 0 (which has x, y, u, v all packed together) and field 1 of S, even though field 1's value would be meaningless. However, for each string in xs, we'll output a fix-up that says "add this data as a string to field 1." The decoder of fix-up instructions can reuse the same read buffer to construct string pointer references, so the string content need not be copied.

When this wire format is decoded, the decoder allocates memory for S and dumps all the field values there. It would consult the definition of S and realizes that it has to initialize field 1 by allocating a list object to hold the strings. Then it would process through the fix-ups and add the strings as required.

This will work for nested structs as well. What I haven't quite figured out is dictionary. The encoding would not be very efficient if we encode the fix-up for the key, then the fix-up for the value. In fact, since all the keys are of the same type, and all the values are of the other type, it is more efficient to encode them as two parallel lists.

With this encoding scheme, I don't need to deal with that inter-segment pointer non-sense.

Earlier on I wrote about an improved binary JSON serialization format. I think there is still a use case for that. Basically if a programming language does not use raw struct layout in memory, then it won't be able to take advantage of the Cap'n Proto styled encoding, and having a compact wire format is probably better. Only C/C++ can really take advantage of Cap'n Proto styled encoding. Also, this raw encoding requires a schema in order to know how to interpret data. With binary JSON, the data is self-tagging and self-delimiting. You can manipulate data even if you don't have their schema.

Monday, May 20, 2013

Exponential back-off algorithm for IP packet fragmentation

On reading about Path MTU discovery, I found this article Never Ending Story of IP Fragmentation. It explains the problem with bad IP fragmentation when there are multiple encapsulations. Long story short, network links can only carry packets of a certain size called "maximum transmission unit" or MTU. Large packets are split into fragments to fit to the MTU. But MTU is reduced when tunneling IP traffic, e.g. GRE (used by VPN like PPTP and IPsec) and PPPoE. If we naively split a packet by next-hop MTU, then we might get this situation:

  • Original packet has 1516 bytes to traverse over an Ethernet link with MTU of 1500. It is split into two packets, fragment 1 of 1500 bytes and fragment 2 of 36 bytes (the excess 16 bytes plus additional 20 bytes IP header).
  • Fragment 1 traverses through a PPPoE tunnel with MTU of 1492 bytes. It is split into two more packets, fragment 1.1 of 1492 bytes and fragment 1.2 of 36 bytes (8 bytes excess + 28 bytes overhead). However, fragment 2 traverses unaffected but now becomes 44 bytes.
  • Fragment 1.1 traverses through a IPsec tunnel with MTU of 1400 bytes. It is split into two more fragments, fragment 1.1.1 of 1400 bytes and fragment 1.1.2 of 152 bytes (the excess 92 bytes plus 60 bytes encapsulation overhead). The other fragment 1.2 of 36 bytes traverses unaffected but becomes 96 bytes after encapsulation; fragment 2 also traverses unaffected but also becomes 96 bytes.
  • Fragment 1.1.1 traverses through another tunnel...
If there are N tunnel encapsulations, for each packet we might get N fragments. This is a blow-up of N factor in the total number of packets sent. This incurs cost both in packet switching and assembly.

The industry standard practice is "Path MTU Discovery" or PMTUD. It sends a series of "do not fragment" packets down the pipe until it gets rejected, and the hop that rejects it advertises the MTU. The originator then reduces MTU and tries again. The sole purpose is that the originator of the packet can figure out how much data to send in order to fit the smallest MTU of all the hops until it reaches the destination, at the cost of large connection startup time. Each MTU discovery incurs a partial round-trip cost, so the total round trip time is \( O(n^2) \) where n is the number of hops. This is especially costly for servers that deal with a large number of clients. The server would have to probe and remember each client's path MTU.

What I'm proposing is to get rid of PMTUD altogether but apply the principle of exponential back-off in fragmentation.

Every time a packet has to be fragmented, divide the packet into two more-or-less equally sized halves. Or keep halving the packet \(k\) times until the \( \frac1{2_k} \) sized fragment fits to the next link's MTU. The rational is that we're going to be splitting the packet anyway, and halving gives extra headroom for the next hops MTU which might be smaller.

In the worst case where each hop's MTU is exponentially decreasing, you might still get \(2^N\) times the number of fragments, but in practice MTU is linearly decreasing. The reason for the linear decrease is due to the extra encapsulation header being stacked on linearly.

Revisiting the above scenario, the original packets would be divided into two of 758 bytes, both traverse through all tunnel layers without further fragmentation, and without knowing the path MTU. Alternatively, after a costly PMTUD, the host might divide the original packets to two of 1400 and 116 bytes respectively, achieving the same effect without PMTUD but with the exponential back-off fragmentation.

Friday, May 17, 2013

Pogoplug V4 spec for future reference

NAND partitions:
/ # cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00200000 00020000 "u-boot"
mtd1: 00300000 00020000 "uImage"
mtd2: 00300000 00020000 "uImage2"
mtd3: 00800000 00020000 "failsafe"
mtd4: 07000000 00020000 "root"
PCI and USB devices:
/ # lspci
00:01.0 Class 0c03: 1b73:1009
/ # lsusb
Bus 001 Device 001: ID 1d6b:0002
Bus 002 Device 001: ID 1d6b:0002
Bus 001 Device 002: ID 0781:5571
CPU info:
/ # cat /proc/cpuinfo 
Processor       : Feroceon 88FR131 rev 1 (v5l)
BogoMIPS        : 799.53
Features        : swp half thumb fastmult edsp 
CPU implementer : 0x56
CPU architecture: 5TE
CPU variant     : 0x2
CPU part        : 0x131
CPU revision    : 1

Hardware        : Feroceon-KW
Revision        : 0000
Serial          : 0000000000000000
dmesg output:
~ # dmesg
console=ttyS0,115200 root=ubi0:rootfs ubi.mtd=4,2048 rootfstype=ubifs
<4>[    0.000000] PID hash table entries: 512 (order: 9, 2048 bytes)
<6>[    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
<6>[    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
<6>[    0.000000] Memory: 128MB = 128MB total
<5>[    0.000000] Memory: 118356KB available (3852K code, 261K data, 124K init, 0K highmem)
<6>[    0.000000] Hierarchical RCU implementation.
<6>[    0.000000] NR_IRQS:128
<4>[    0.000000] Console: colour dummy device 80x30
<6>[    0.000000] Calibrating delay loop... 799.53 BogoMIPS (lpj=3997696)
<4>[    0.230000] Mount-cache hash table entries: 512
<6>[    0.230000] CPU: Testing write buffer coherency: ok
<6>[    0.230000] NET: Registered protocol family 16
<6>[    0.230000] Feroceon L2: Enabling L2
<6>[    0.230000] Feroceon L2: Cache support initialised.
<4>[    0.230000] 
<4>[    0.230000] CPU Interface
<4>[    0.230000] -------------
<4>[    0.230000] SDRAM_CS0 ....base 00000000, size 128MB 
<4>[    0.230000] SDRAM_CS1 ....disable
<4>[    0.230000] SDRAM_CS2 ....disable
<4>[    0.230000] SDRAM_CS3 ....disable
<4>[    0.230000] PEX0_MEM ....base e0000000, size 128MB 
<4>[    0.230000] PEX0_IO ....base f2000000, size   1MB 
<4>[    0.230000] PEX1_MEM such
<4>[    0.230000] PEX1_IO such
<4>[    0.230000] INTER_REGS ....base f1000000, size   1MB 
<4>[    0.230000] NFLASH_CS ....base fa000000, size   2MB 
<4>[    0.230000] SPI_CS ....base f4000000, size  16MB 
<4>[    0.230000] BOOT_ROM_CS such
<4>[    0.230000] DEV_BOOTCS such
<4>[    0.230000] CRYPT_ENG ....base f0000000, size   2MB 
<4>[    0.230000] 
<4>[    0.230000]   Marvell Development Board (LSP Version KW_LSP_5.1.3_patch18)-- RD-88F6192A-NAS  Soc: 88F6192 A1 LE
<4>[    0.230000] 
<4>[    0.230000]  Detected Tclk 166666667 and SysClk 200000000 
<4>[    0.230000] Marvell USB EHCI Host controller #0: c403e740
<4>[    0.730000] PEX0 interface detected Link X1
<7>[    0.730000] pci 0000:00:01.0: reg 10 64bit mmio: [0x40000000-0x4000ffff]
<7>[    0.730000] pci 0000:00:01.0: reg 18 64bit mmio: [0x40010000-0x40010fff]
<7>[    0.730000] pci 0000:00:01.0: reg 20 64bit mmio: [0x40011000-0x40011fff]
<7>[    0.740000] pci 0000:00:01.0: supports D1
<6>[    0.740000] pci 0000:00:01.0: PME# supported from D0 D1 D3hot
<6>[    0.740000] pci 0000:00:01.0: PME# disabled
<6>[    0.740000] PCI: bus0: Fast back to back transfers disabled
<4>[    0.740000] mvPexLocalBusNumSet: ERR. Invalid PEX interface 1
<4>[    0.750000] bio: create slab <bio-0> at 0
<5>[    0.750000] SCSI subsystem initialized
<6>[    0.750000] usbcore: registered new interface driver usbfs
<6>[    0.750000] usbcore: registered new interface driver hub
<6>[    0.750000] usbcore: registered new device driver usb
<6>[    0.750000] NET: Registered protocol family 2
<6>[    0.750000] IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.750000] TCP established hash table entries: 4096 (order: 3, 32768 bytes)
<6>[    0.750000] TCP bind hash table entries: 4096 (order: 2, 16384 bytes)
<6>[    0.750000] TCP: Hash tables configured (established 4096 bind 4096)
<6>[    0.750000] TCP reno registered
<6>[    0.750000] NET: Registered protocol family 1
<6>[    0.750000] cpufreq: Init kirkwood cpufreq driver
<7>[    0.750000] cpufreq: High frequency: 800000KHz - Low frequency: 200000KHz
<7>[    0.750000] cpufreq: Setting CPU Frequency to 800000 KHz
<7>[    0.750000] cpufreq: Setting PowerSaveState to off
<6>[    0.760000] XOR registered 4 channels
<6>[    0.760000] XOR 2nd invalidate WA enabled
<4>[    0.760000] cesadev_init(c000d7fc)
<4>[    0.760000] mvCesaInit: sessions=640, queue=64, pSram=f0000000
<6>[    0.760000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
<6>[    0.770000] msgmni has been set to 231
<6>[    0.770000] alg: No test for cipher_null (cipher_null-generic)
<6>[    0.770000] alg: No test for ecb(cipher_null) (ecb-cipher_null)
<6>[    0.770000] alg: No test for digest_null (digest_null-generic)
<6>[    0.770000] alg: No test for compress_null (compress_null-generic)
<6>[    0.780000] alg: No test for stdrng (krng)
<6>[    0.780000] alg: No test for hmac(digest_null) (hmac(digest_null-generic))
<6>[    0.790000] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
<6>[    0.790000] io scheduler noop registered
<6>[    0.790000] io scheduler anticipatory registered (default)
<4>[    0.790000] Initializing ths8200_init
<4>[    0.790000] Initializing dove_adi9889_init
<6>[    0.810000] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
<6>[    0.810000] serial8250.0: ttyS0 at MMIO 0xf1012000 (irq = 33) is a 16550A
<6>[    0.810000] console [ttyS0] enabled
<4>[    0.820000] Integrated Sata device found
<4>[    0.830000] IRQ 21/mvSata: IRQF_DISABLED is not guaranteed on shared IRQs
<6>[    0.850000] scsi0 : Marvell SCSI to SATA adapter
<6>[    0.860000] scsi1 : Marvell SCSI to SATA adapter
<4>[    0.870000] Loading Marvell Ethernet Driver:
<4>[    0.870000]   o Cached descriptors in DRAM
<4>[    0.880000]   o DRAM SW cache-coherency
<4>[    0.880000]   o 1 Giga ports supported
<4>[    0.880000]   o Single RX Queue support - ETH_DEF_RXQ=0
<4>[    0.890000]   o Single TX Queue support - ETH_DEF_TXQ=0
<4>[    0.890000]   o TCP segmentation offload (TSO) supported
<4>[    0.900000]   o Large Receive offload (LRO) supported
<4>[    0.900000]   o Receive checksum offload supported
<4>[    0.910000]   o Transmit checksum offload supported
<4>[    0.910000]   o Network Fast Processing (Routing) supported - (Disabled)
<4>[    0.920000]   o Driver ERROR statistics enabled
<4>[    0.930000]   o Proc tool API enabled
<4>[    0.930000]   o SKB Reuse supported - (Disabled)
<4>[    0.930000]   o SKB Recycle supported - (Disabled)
<4>[    0.940000]   o Rx descripors: q0=128
<4>[    0.940000]   o Tx descripors: q0=532
<4>[    0.950000]   o Loading network interface(s):
<4>[    0.950000]      o register under mv88fx_eth platform
<4>[    0.960000]      o eth0, ifindex = 2, GbE port = 0
<4>[    0.960000] 
<4>[    0.960000] mvFpRuleDb (c45b2000): 1024 entries, 4096 bytes
<4>[    0.970000] Counter=0, opIdx=6, overhead=16
<4>[    0.970000] Counter=1, opIdx=2, overhead=0
<4>[    0.980000] Counter=2, opIdx=1, overhead=18
<4>[    0.980000] Counter=3, opIdx=2, overhead=0
<6>[    0.990000] tun: Universal TUN/TAP device driver, 1.6
<6>[    0.990000] tun: (C) 1999-2004 Max Krasnyansky <>
<6>[    1.000000] NAND device: Manufacturer ID: 0xad, Chip ID: 0xf1 (Hynix NAND 128MiB 3,3V 8-bit)
<6>[    1.010000] Scanning device for bad blocks
<4>[    1.060000] Using static partition definition
<5>[    1.060000] Creating 5 MTD partitions on "nand_mtd":
<5>[    1.070000] 0x000000000000-0x000000200000 : "u-boot"
<5>[    1.070000] 0x000000200000-0x000000500000 : "uImage"
<5>[    1.080000] 0x000000500000-0x000000800000 : "uImage2"
<5>[    1.080000] 0x000000800000-0x000001000000 : "failsafe"
<5>[    1.090000] 0x000001000000-0x000008000000 : "root"
<5>[    1.100000] UBI: attaching mtd4 to ubi0
<5>[    1.100000] UBI: physical eraseblock size:   131072 bytes (128 KiB)
<5>[    1.110000] UBI: logical eraseblock size:    126976 bytes
<5>[    1.110000] UBI: smallest flash I/O unit:    2048
<5>[    1.120000] UBI: sub-page size:              512
<5>[    1.120000] UBI: VID header offset:          2048 (aligned 2048)
<5>[    1.130000] UBI: data offset:                4096
<5>[    1.350000] UBI: attached mtd4 to ubi0
<5>[    1.350000] UBI: MTD device name:            "root"
<5>[    1.360000] UBI: MTD device size:            112 MiB
<5>[    1.360000] UBI: number of good PEBs:        896
<5>[    1.370000] UBI: number of bad PEBs:         0
<5>[    1.370000] UBI: max. allowed volumes:       128
<5>[    1.380000] UBI: wear-leveling threshold:    4096
<5>[    1.380000] UBI: number of internal volumes: 1
<5>[    1.390000] UBI: number of user volumes:     1
<5>[    1.390000] UBI: available PEBs:             0
<5>[    1.400000] UBI: total number of reserved PEBs: 896
<5>[    1.400000] UBI: number of PEBs reserved for bad PEB handling: 8
<5>[    1.410000] UBI: max/mean erase counter: 1/0
<5>[    1.410000] UBI: image sequence number: 0
<6>[    1.420000] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
<6>[    1.420000] ehci_marvell ehci_marvell.70059: Marvell Orion EHCI
<6>[    1.430000] ehci_marvell ehci_marvell.70059: new USB bus registered, assigned bus number 1
<5>[    1.440000] UBI: background thread "ubi_bgt0d" started, PID 452
<6>[    1.470000] ehci_marvell ehci_marvell.70059: irq 19, io base 0xf1050100
<6>[    1.490000] ehci_marvell ehci_marvell.70059: USB 2.0 started, EHCI 1.00
<6>[    1.490000] usb usb1: configuration #1 chosen from 1 choice
<6>[    1.500000] hub 1-0:1.0: USB hub found
<6>[    1.500000] hub 1-0:1.0: 1 port detected
<6>[    1.510000] xhci_hcd 0000:00:01.0: xHCI Host Controller
<6>[    1.510000] xhci_hcd 0000:00:01.0: new USB bus registered, assigned bus number 2
<6>[    1.520000] xhci_hcd 0000:00:01.0: irq 9, io mem 0xe0000000
<4>[    1.530000] usb usb2: config 1 interface 0 altsetting 0 endpoint 0x81 has no SuperSpeed companion descriptor
<6>[    1.540000] usb usb2: configuration #1 chosen from 1 choice
<7>[    1.540000] xHCI xhci_add_endpoint called for root hub
<7>[    1.540000] xHCI xhci_check_bandwidth called for root hub
<6>[    1.540000] hub 2-0:1.0: USB hub found
<6>[    1.550000] hub 2-0:1.0: 4 ports detected
<6>[    1.550000] Initializing USB Mass Storage driver...
<6>[    1.550000] usbcore: registered new interface driver usb-storage
<6>[    1.560000] USB Mass Storage support registered.
<6>[    1.570000] usbcore: registered new interface driver ums-datafab
<6>[    1.570000] usbcore: registered new interface driver ums-freecom
<6>[    1.580000] usbcore: registered new interface driver ums-jumpshot
<6>[    1.580000] usbcore: registered new interface driver ums-sddr09
<6>[    1.590000] usbcore: registered new interface driver ums-sddr55
<6>[    1.600000] usbcore: registered new interface driver ums-usbat
<6>[    1.600000] mice: PS/2 mouse device common for all mice
<6>[    1.610000] i2c /dev entries driver
<7>[    1.610000] cpufreq: Setting CPU Frequency to 800000 KHz
<7>[    1.610000] cpufreq: Setting PowerSaveState to off
<6>[    1.620000] sdhci: Secure Digital Host Controller Interface driver
<6>[    1.620000] sdhci: Copyright(c) Pierre Ossman
<5>[    1.630000] mmc0: mvsdio driver initialized, using GPIO 27 for card detection
<6>[    1.640000] usbcore: registered new interface driver usbhid
<6>[    1.640000] usbhid: v2.6:USB HID core driver
<6>[    1.650000] TCP cubic registered
<6>[    1.650000] NET: Registered protocol family 17
<6>[    1.660000] RPC: Registered udp transport module.
<6>[    1.660000] RPC: Registered tcp transport module.
<4>[    1.670000] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
<5>[    1.740000] UBIFS: recovery needed
<5>[    1.820000] UBIFS: recovery completed
<5>[    1.830000] UBIFS: mounted UBI device 0, volume 0, name "rootfs"
<5>[    1.830000] UBIFS: file system size:   110850048 bytes (108252 KiB, 105 MiB, 873 LEBs)
<5>[    1.840000] UBIFS: journal size:       9023488 bytes (8812 KiB, 8 MiB, 72 LEBs)
<5>[    1.850000] UBIFS: media format:       w4/r0 (latest is w4/r0)
<5>[    1.850000] UBIFS: default compressor: lzo
<5>[    1.860000] UBIFS: reserved for root:  0 bytes (0 KiB)
<4>[    1.860000] VFS: Mounted root (ubifs filesystem) on device 0:11.
<6>[    1.870000] Freeing init memory: 124K
<6>[    1.880000] usb 1-1: new high speed USB device using ehci_marvell and address 2
<6>[    2.050000] usb 1-1: configuration #1 chosen from 1 choice
<6>[    2.060000] scsi2 : SCSI emulation for USB Mass Storage devices
<7>[    2.070000] usb-storage: device found at 2
<7>[    2.070000] usb-storage: waiting for device to settle before scanning
<5>[    4.290000] eth0: link down
<5>[    4.290000] eth0: started
<5>[    5.840000] eth0: link up, full duplex, speed 1 Gbps
<5>[    7.070000] scsi 2:0:0:0: Direct-Access     SanDisk  Cruzer Fit       1.26 PQ: 0 ANSI: 5
<5>[    7.090000] sd 2:0:0:0: [sda] 31266816 512-byte logical blocks: (16.0 GB/14.9 GiB)
<5>[    7.100000] sd 2:0:0:0: [sda] Write Protect is off
<7>[    7.100000] sd 2:0:0:0: [sda] Mode Sense: 43 00 00 00
<3>[    7.100000] sd 2:0:0:0: [sda] Assuming drive cache: write through
<3>[    7.110000] sd 2:0:0:0: [sda] Assuming drive cache: write through
<6>[    7.120000]  sda: sda1
<5>[    7.120000] sd 2:0:0:0: Attached scsi generic sg0 type 0
<7>[    7.140000] usb-storage: device scan complete
<3>[    7.150000] sd 2:0:0:0: [sda] Assuming drive cache: write through
<5>[    7.160000] sd 2:0:0:0: [sda] Attached SCSI removable disk
<4>[    7.860000] IRQ 93 uses trigger mode 0; requested 3
<6>[    7.870000] input: gpio-keys as /devices/platform/gpio-keys.0/input/input0
<6>[    7.880000] xce_ebtn: Pogoplug series V4 eject button initialized.
<4>[    8.030000] ufsd: module license 'Commercial product' taints kernel.
<4>[    8.030000] Disabling lock debugging due to kernel taint
<5>[    8.070000] ufsd: driver (8.6 (U86_S[2012-02-28-18:39:23]), LBD=ON, delalloc, ioctl) loaded at bf00c000
<5>[    8.070000] NTFS support included
<5>[    8.070000] Hfs+/HfsX support included
<5>[    8.070000] For 'CloudEngines_PogoPlug_2011-08-03'
<4>[    8.400000] rtusb init rt2870 --->
<6>[    8.410000] usbcore: registered new interface driver rt2870
<4>[    8.440000] Cloud Engines XCE Init [Version:]
<6>[    8.440000] XCE: CPU MEMORY MAP:
<6>[    8.440000] XCE:   -- 0x00001000 - 0xbeffffff (3055 MB)  User Space Mappings
<6>[    8.450000] XCE:   -- 0xbf000000 - 0xbfffffff (  16 MB)  Kernel module space
<6>[    8.460000] XCE:   -- 0xc0000000 - 0xc7ffffff ( 128 MB)  Kernel direct-mapped ram
<6>[    8.470000] XCE:   -- 0xc8800000 - 0xe7ffffff ( 504 MB)  Kernel vmalloc space
<6>[    8.470000] XCE:   -- 0xe8000000 - 0xfeffffff ( 367 MB)  Kernel platform space
<6>[    8.480000] XCE: CPU FEATURES:
<6>[    8.480000] XCE:   -- I Cache:         enabled
<6>[    8.490000] XCE:   -- D Cache:         enabled
<6>[    8.490000] XCE:   -- Branch Predict:  disabled
<6>[    8.500000] XCE:   -- MMU:             enabled
<6>[    8.500000] XCE:   -- Alignment Abort: enabled
<6>[    8.510000] XCE: BLPARAMS:   -- Loading properties [c4d49efc].
<6>[    8.520000] XCE: BLPARAMS:   -- MTD @ [c45c1c00].
<6>[    8.520000] XCE: BLPARAMS: Locating parameter block...
<6>[    8.530000] XCE: BLPARAMS: reading 2048 bytes @ a0000
<6>[    8.530000] XCE: Loaded Property Size: 2048
<6>[    8.540000] XCE:    - 'cesvcid' -> '9C5FE2YZ96Z72G5C8AWKUNJHWW'
<6>[    8.540000] XCE:    - 'ceboardver' -> 'PPV4A3'
<6>[    8.550000] XCE:   -- ICache Prefetch: enabled
<6>[    8.550000] XCE:   -- DCache Prefetch: enabled
<6>[    8.560000] XCE:   -- L2 Cache:        enabled
<6>[    8.560000] XCE:   -- L2 Prefetch:     disabled
<6>[    8.570000] XCE:   -- L2 Writethrough: disabled
<6>[    8.570000] XCE:   -- Write Allocate:  disabled
<6>[    8.580000] XCE:   -- Streaming:       disabled
<6>[    8.580000] XCE: Current GPIO State:
<6>[    8.580000] XCE:  GPIO L OUT:    0x01f18400
<6>[    8.590000] XCE:  GPIO L OE:     0xfe004800
<6>[    8.590000] XCE:  GPIO L BLINK:  0x00000000
<6>[    8.600000] XCE:  GPIO L POL:    0x28000000
<6>[    8.600000] XCE:  GPIO L IN:     0x11f00000
<6>[    8.600000] XCE:  GPIO H OUT:    0x00000008
<6>[    8.610000] XCE:  GPIO H OE:     0x00000005
<6>[    8.610000] XCE:  GPIO H BLINK:  0x00000000
<6>[    8.620000] XCE:  GPIO H POL:    0x00000000
<6>[    8.620000] XCE:  GPIO H IN:     0x00000008
<6>[    8.720000] XCE: BLPARAMS:   -- Loading properties [c4babecc].
<6>[    8.720000] XCE: BLPARAMS:   -- MTD @ [c45c1c00].
<6>[    8.730000] XCE: BLPARAMS: Locating parameter block...
<6>[    8.730000] XCE: BLPARAMS: reading 2048 bytes @ a0000
<6>[    8.740000] XCE: BLPARAMS: reading 2048 bytes @ a0800
<6>[    8.750000] XCE: BLPARAMS: reading 2048 bytes @ a1000
<6>[    8.750000] XCE: BLPARAMS: reading 2048 bytes @ a1800
<6>[   14.370000] XCE: XCE: LED -> CONNECTED
<6>[   14.400000] kjournald starting.  Commit interval 5 seconds
<6>[   14.400000] EXT3 FS on sda1, internal journal
<6>[   14.410000] EXT3-fs: mounted filesystem with writeback data mode.
<6>[   14.730000] usb 1-1: reset high speed USB device using ehci_marvell and address 2

Saturday, May 11, 2013

Installing Debian on Pogoplug Series V4

Here I'm basing the tools and instruction on Arch Linux. I'm using a PogoPlug Series V4 with a SamDisk Cruzer Fit 16GB flash drive which fits snuggly on the top USB 2.0 port, even with the lid closed.

I also followed the instructions by Moustafa Hassan to identify and solder the serial console pins, but chose a different connection method. I bought female-female jumper wires, extra long breakaway headers to couple the jumper wires, and USB to TTL serial cable which works out of box on Linux. Although it is possible to ssh into PogoPlug V4, having serial console access is a safety net in case I brick the Pogo.
  • Select black, green, and white wires. This will be used to make the external connection.
  • Pick a bundle of three other wires. I chose grey, purple and blue because that happen to come off next. Do not disband them. Cut these in two halves together, strip one half and save the other half for future use. Solder grey, purple, blue to GND, RX, TX respectively. You might want to glue them to the board since the solder joints come off easily.
    • Unfortunately you can't solder breakaway header to the board. The holes are not 1mm spaced, and the holes are too small.
  • Use the extra long breakaway to connect grey to black, purple to green, and blue to white. This connection will be inside the Pogo.
  • Carefully extend the black (GND), green (RX), white (TX) wires through the vent of the top cover, cap them with another set of breakaway header. These colors will be the same as the USB-TTL serial cable. Red is not used.
Hint: on Debian, adding your non-root user to the "dialout" group (logout, log back in) will allow you use the command "screen /dev/ttyUSB0 115200" without sudo. To find out exactly which group you need to add to on your distro, ls -l /dev/ttyUSB0 and look at the group of the file.

Installing Debian is significantly more work than Arch Linux, and generally requires another Linux machine to do the first stage bootstrapping.

I was going to use Multistrap to prepare a root filesystem but I changed that plan. Since I already have serial console and can TFTP boot anything I want, I can now TFTP boot and use the interactive Debian Installer. After interactively installing, the plan is to manually configure the bootloader using Arch Linux's U-Boot which has USB support.

To boot the Debian Installer, I need to do this one time setup at the U-Boot prompt to make mainline kernel boot on PogoPlug which is an unrecognized device. I found this information at the Plug Computer Forum.
setenv mainlineLinux yes
setenv arcNumber 2097
Correction: When using arcNumber 2097 (SheevaPlug), USB3 is not working because PCIe is not initialized, but GigE is working. When using arcNumber 1681 (RD-88F6192-NAS), USB3 works but not GigE. For proper Pogoplug V4 support, one has to build a Debian kernel with the archlinuxarm.patch.
Unfortunately, setting mainlineLinux=yes will prevent the Pogo's built-in Linux kernel from booting, so you'll have to set mainlineLinux=no if you want it back.

And then, following Debian installer for plug computer instructions, after downloading the SheevaPlug installer files to TFTP as debian/uImage and debian/uInitrd, I could boot d-i over TFTP like this:
setenv bootargs console=ttyS0,115200n8 base-installer/initramfs-tools/driver-policy=most
tftp 0x00800000 debian/uImage
tftp 0x01100000 debian/uInitrd
bootm 0x00800000 0x01100000
I installed Debian. The installer finished without requring swap space which is impressive considering that there is only 128MB of main memory. However, to get Debian to boot, I had to boot into PogoPlug Linux first and run First boot into U-boot:
setenv mainlinuxLinux no
Boot into PogoPlug Linux, download and run it. But I need to make more adjustments before Debian can boot. Boot into U-Boot again and run:
setenv mainlinuxLinux yes
setenv alarm_usb 'ext2load usb 0:1 0x800000 /boot/uImage; ext2load usb 0:1 0x1100000 /boot/uInitrd; run alarm_which; run alarm_args; bootm 0x800000 0x1100000'
The reason is that ArchLinux boot script only loads the kernel but not the initramfs. If you boot from SATA/IDE, then make the analogous change to alarm_ide also. Debian should now boot normally.

Testing U-Boot over TFTP

Before I flash my PogoPlug v4 with a custom compiled u-boot, I thought it is prudent to be able to test it and make sure it works before I permanently brick the device. Although according to this circuit board breakdown, there are suspected I/O connection to a 5-pin J15 to the Hynix NAND flash chip, it is not clear if I'd be able to flash a bricked device. Better not take the risk.

First setup a tftpd-hpa server. This is just as easy as apt-get install tftpd-hpa and then copy the file to the serving directory, which is /srv/tftp by default. Make sure everyone can read the file copied there. You don't need to configure DHCP to hand out the tftp server's IP address just for testing.

This is what I did to compile my U-Boot. This is based on notes from Always Innovating but using a different configuration.
  • Install emdebian cross compilation toolchain which installs /usr/bin/arm-linux-gnueabi-{gcc,as,ld,...}.
  • git clone git://
  • cd u-boot && git checkout v2013.04  # use a later release if appropriate.
    • To enable ext4, you might want to apply this patch.
  • make CROSS_COMPILE=arm-linux-gnueabi- sheevaplug_config
  • make CROSS_COMPILE=arm-linux-gnueabi- all
This creates a u-boot.bin in the u-boot source directory. Before I continue, I just want to say that sheevaplug_config doesn't work on pogoplug v4. It's worth finding out. But before that let's do a comparison with what works: the Pogoplug V4 opensourced U-Boot at 1.1.4.
tar jxf pogomv-u-boot.tar.bz2
cd pogomv-u-boot
make CROSS_COMPILE=arm-linux-gnueabi- pogoplug2_config
make CROSS_COMPILE=arm-linux-gnueabi- all
cp u-boot-pogoplug2.bin /srv/tftp
Over serial console, power on the device and hit enter as soon as you see:
U-Boot 1.1.4 (Oct  1 2011 - 12:06:06) Cloud Engines 1.1.2 (3.4.27) PHYADDR=0

U-Boot code: 00600000 -> 0067FFF0  BSS: -> 006918B4
Soc: 88F6192 A1 (DDR2)
CPU running @ 800Mhz L2 running @ 400Mhz
SysClock = 200Mhz , TClock = 166Mhz
DRAM CAS Latency = 3 tRP = 3 tRAS = 8 tRCD=3
DRAM CS[0] base 0x00000000   size 128MB
DRAM Total size 128MB  16bit width
Addresses 8M - 0M are saved for the U-Boot usage.
Mem malloc Initialization (8M - 7M): Done
Flash:  0 kB
CPU : Marvell Feroceon (Rev 1)
Streaming disabled
Write allocate disabled
USB 0: host mode
PEX 0: PCI Express Root Complex Interface
PEX interface detected Link X1
Net:   egiga0 [PRIME]
Hit any key to stop autoboot:  0 Do it now!
This enters the device's factory default U-Boot prompt (not the one you just compiled).

Let's get the device an IP address. You'll need to hit Ctrl-C to abort loading because my DHCP doesn't give out the correct TFTP server's address. It's using (gateway) as the TFTP but my TFTP is at
CE>> dhcp
BOOTP broadcast 1
*** Unhandled DHCP Option in OFFER/ACK: 28
*** Unhandled DHCP Option in OFFER/ACK: 28
DHCP client bound to address
*** Warning: no boot file name; using '0A0005CE.img'
Using egiga0 device
TFTP from server; our IP address is
Filename '0A0005CE.img'.
Load address: 0x2000000
Loading: * # hit Ctrl-C here
At this point your device will forget its IP address, but you can set it back using setenv. Essentially we're using the DHCP command just to get an IP address. Here I'm also setting the correct TFTP server's IP address I want to be using.
CE>> setenv ipaddr
CE>> setenv serverip
And now we can TFTP. We need to specify the filename and the memory location it will be written to once downloaded. The convention is to always chainload to 0x800000. The chainloaded U-Boot will relocate itself back to 0x67FFF0 which is the default CONFIG_SYS_TEXT_BASE.
CE>> tftp 0x800000 u-boot-pogoplug2.bin
Using egiga0 device
TFTP from server; our IP address is
Filename 'u-boot-pogoplug2.bin'.
Load address: 0x800000
Loading: #################################################################
Bytes transferred = 474148 (73c24 hex)
When we start executing the chainloaded U-Boot, it will appear as if the device has reset itself, but it's really our newly compiled binary (see the U-Boot compilation date).
CE>> go 0x800000
## Starting application at 0x00800000 ...

U-Boot 1.1.4 (May 11 2013 - 18:53:25) Cloud Engines 1.1.2 (3.4.27) PHYADDR=0

U-Boot code: 00600000 -> 0067FFF0  BSS: -> 006918AC

Soc: 88F6192 A1 (DDR2)
CPU running @ 800Mhz L2 running @ 400Mhz
SysClock = 200Mhz , TClock = 166Mhz 

DRAM CAS Latency = 3 tRP = 3 tRAS = 8 tRCD=3
DRAM CS[0] base 0x00000000   size 128MB
DRAM Total size 128MB  16bit width
Addresses 8M - 0M are saved for the U-Boot usage.
Mem malloc Initialization (8M - 7M): Done
Flash:  0 kB

CPU : Marvell Feroceon (Rev 1)

Streaming disabled
Write allocate disabled

USB 0: host mode
PEX 0: PCI Express Root Complex Interface
PEX interface detected Link X1
Net:   egiga0 [PRIME]
Hit any key to stop autoboot:  0 do it now!
Don't forget to hit any key to stop at this point! And then after setenv serverip and ipaddr again. We can now tftp load the u-boot.bin we compiled for sheevaplug_config.
CE>> setenv serverip
CE>> setenv ipaddr
CE>> tftp 0x800000 sheevaplug.bin
Using egiga0 device
TFTP from server; our IP address is
Filename 'sheevaplug.bin'.
Load address: 0x800000
Loading: #################################################################
Bytes transferred = 397856 (61220 hex)
CE>> go 0x800000
## Starting application at 0x00800000 ...
data abort
pc : [<008000e8>]          lr : [<00800064>]
sp : 005ffe10  ip : 006554f9     fp : 00000000
r10: 00000000  r9 : 00673674     r8 : 005fffcc
r7 : 00000000  r6 : 00800000     r5 : 00721c24  r4 : 00000002
r3 : f1012000  r2 : 00000020     r1 : e1a03803  r0 : 00625131
Flags: nzcv  IRQs off  FIQs off  Mode SVC_32
Resetting CPU ...
This proves that Pogoplug V4 doesn't like sheevaplug's configuration. I've tested Davy Gravy's patch for pogo_v4_config but couldn't get it to work. I did manage to get the u-boot from ArchLinux's PogoPlug v4 u-boot to work.

Upon closer inspection, I noticed a few things about ArchLinux's U-boot.

  • It's probably based on Pogoplug V4's u-boot at 1.1.4, except with USB enabled somehow.
  • The script writes ArchLinux u-boot to /dev/mtd0 for 4 blocks at 1MB offset, avoid overwriting the default u-boot.
  • The boot script distinguishes factory u-boot from ArchLinux by probing for USB support (this is what "if usb start" is for). The default u-boot has no USB.
I also looked at porting PogoPlug v4 U-boot to a more recent version of U-boot but I gave up. The diff from stock U-boot 1.1.4 to PogoPlug v4 U-boot 1.1.4 reveals that it added a bunch of general purpose Marvell device support. It even included and compiled several files from the Linux kernel such as mv_cesa.c which has no business in U-boot. It's pretty messy.

I also tried the Marvell custodian repo of U-boot without luck. I think ArchLinux's approach is probably as good as it gets ... take the PogoPlug v4 U-boot 1.1.4 and enable USB.

Saturday, April 27, 2013

Using Git for Making Law

The other day I was reading the transcript of the Constitution of the United States, and I noticed annotations in the body whenever a paragraph is changed by an amendment. In a sense, legal text has developed a process of "patching" similar to how programmers manage source code of a software system. A "patch" in the revision control sense is a precise editing instruction for insertion and deletion of text, and that allows a reviewer to easily tell exactly what has been changed.

There are many revision control systems, but I'm only going to write about this one called Git because that's the one I know more about. Git is a particularly interesting revision control system in that it enjoys the following properties.
  • The repository is distributed. A complete revision history is replicated among people who choose to clone a repository. This includes all objects that are required to reconstruct a given snapshot some point in the history.
  • Revision is structured as a tree-like lineage of the SHA1 cryptographic digest which serves the purpose of integrity checking. That means any modification to the history will result in a different lineage because the digest will be different. Forging SHA1 digest is a computationally expensive process that requires brute-forcing at least \( 2^{61} \) possibilities to find a collision.
  • Each revision, called a "commit", identifies authorship of the change. Furthermore, a commit can be signed using asymmetric cryptography. This allows a person other than the author of the changes to endorse a change. The endorsement is also computationally hard to forge and can be easily verified by a computer.
It turns out that Joel C. Salomon had attempted to represent the US constitution as a Git repository as an exercise, after a proposal on Hacker News. I think this exercise would become even more interesting if he had incorporated cryptographic signing to his approach.

When using asymmetric cryptography, a pair of keys are made that are mathematically associated, the public key and the secret (private) key. For encryption, the public key is used to encrypt, and the secret key is used to decrypt. For signing, the secret key is used to sign, and the public key is used to verify the signature. The public key is what everyone else has to trust, while the secret key most be kept safely guarded or the trust would be breached.

In effect, using Git, we can develop a totally democratic legal system where:
  • Anyone can clone the text of law and make modifications to their own drafts. They can also see how other drafts are formulated over time and who wrote what section.
  • The complete lineage of legal system development will be of interest to legal historians.
  • Only revisions that are signed by a trusted public key will become legally effective.
Of course, this brings the question of how to establish trust. Normal individuals can use a Web of Trust system where a group will endorse each other's public key for trustworthiness. People could organize a key-signing party where they will meet each other in person and endorse the public keys of the people they met. For the purpose of public affair, it's probably easier to organize trust by certificate authority. However, regardless of how trust is established, the public key is only as trustworthy as the manner how the secret key is guarded.

As a corollary, if a secret key is stored on a computer, then the public key is only as trustworthy as the computer that stores the secret key. We all know how computers are notoriously difficult to secure. It then makes sense to have a limited-purpose computer to store secret keys, and reduce the problem to physical security of that computer. Although secret keys can also be encrypted with a passphrase, brute-forcing the passphrase is still much easier than decrypting a message or forging a signature without the secret key.

And then I found out about smart cards that are essentially the limited-purpose computer. Git uses GnuPG for revision signing, and there are two main card types that can be used with GnuPG. One is the commercially available PKCS#15 cards supported by OpenSC (via gnupg-pkcs11), and the other is OpenPGP card developed by g10code. Gooze gives certain free software developers a free Feitian PKI smart card which is a PKCS#15 card. Free Software Foundation Europe gives every fellowship member an Fellowship Smartcard which is a branded OpenPGP card. You can also buy OpenPGP cards from Kernel Concepts in Germany.

These cryptography smart cards are interesting because they can generate the public and secret key pair on the card, and the secret key never leaves the card. The signing and decryption happens on the card itself. The card is protected by a passphrase similar to how secret key is protected. In addition, the card will only allow a limit number of tries before locking itself out, so it improves the physical security.

However, if the secret key never leaves the smart card, then losing the smart card will lose the secret key which is a big deal because the web of trust has to be established from scratch again. To deal with key management problems, some people create several key pairs, a master key pair used for signing only, and several subkeys, one for encryption, one for signing, and one for authentication. Trust is established for the master key, which is used to endorse the subkeys.

Note that FSFE does not recommend using the smart card for the master key, but only for subkeys. I'm not sure what their rationale is, but I think it should be the other way around.

First create a master key pair on the primary smart card. Use it to endorse subkeys. The subkeys can be stored on a relatively insecure computer for daily use, or on a secondary smart card that you carry around. Keep the master key smart card in a safe place like a bank vault. Only use the master key when revoking an existing subkey, endorsing new subkeys, or endorsing other people's master public key.

The technology to revolutionize the editorial process of law making is readily available and in practical use. Git can be used to track revision lineage, and asymmetric cryptography can be used to given authority on a particular revision. Practical means to establish and secure trust is available in the form of a smart card, and there are effective means of key management that minimizes the chance of losing a key. It is finally time for law making which existed since the beginning of written history to enter the informational age!

Saturday, April 13, 2013

Notes about Interprocess Communication


There are many distributed systems and IPC designs, but forget about them for a moment. Let's design an IPC from scratch. The basic idea is to take a procedure call like this:
char buf[64];
int len = snprintf(buf, sizeof(buf), "The answer is %d", 42);
And take snprintf() out transparently to a separate process. One reason for doing that is that we may have a buggy snprintf() implementation which crashes 1 out of 40 times, and we don't want snprintf() to crash our program. By taking snprintf() out to a separate process, we now also have the liberty to retry when that buggy code crashes. Process isolation means that the buggy code cannot corrupt our memory. We are now fault isolated.

Why do we want fault isolation? Let's say I program in ATS, and I proved that my program can never crash. But I need to use third-party libraries which are only as reliable as (fill in a witty expletive here), and that makes my program just as reliable.

Message passing

Before we can call snprintf() in an isolated process, the runtime system needs to answer a few questions.
  • Discovery: where do I find a process that implements snprintf()?
  • Message passing:
    • How do I request that snprintf() be called?
    • How do I retrieve the result from snprintf() when it is done?
For discovery, there are several ways:
  • Start the process as a child process yourself, and become its babysitter so that you'd restart it when it crashes.
  • Share this process with other processes, and let a system-wide babysitter handle program crashes.
    • This shared process might be on a different computer across the network. You use a name service to find it.
For message passing, first you need a transport, then a wire format. For the transport, you'd probably just open a socket. Local processes can use Unix domain socket, and remote processes can use TCP/IP socket. You can also use UDP if you want to handle potential packet loss yourself. You can also use TLS sockets if you want authentication and encryption. The transport might provide information like remote address and credentials for the purpose of authorization. Authorization is done by the program in order to decide whether to accept or reject the message.

The wire format consists of a frame and serialization of messages. You need a frame so the receiver knows how many bytes to expect. Framing might not be necessary for a datagram socket that already handles frames as part of the underlying protocol. Framing also might not be necessary if the underlying message serialization has an unambiguous length. For the frame, it suffices to just have a message length followed by that number of message bytes. You can use a varint to make the length flexible. If you use a fixed size integer, you're also deciding the effective maximum message size at this point. The message itself would be a byte sequence that serializes detail of the message.

For the purpose of IPC, the request message encodes the function name and arguments, and the response message encodes the return value. In the snprintf() case, the request message could be just ["snprintf", address_of(buf), 64, "The answer is %d", 42], and the response message might be [16].

Out of band communication

Here is an interesting problem. If snprintf() is now in a separate memory isolated process, how could it modify our buffer? Some IPC design forces snprintf() to return the resulting string as part of the response message. This might work okay if the result size is small. If the result consists of mostly large binary array, e.g. a pixel buffer, then the overhead to encode, copy over transport, and decode would be too high. In this case, it is better to use message passing as the "control channel" and use an optimized out of band communication method to deliver large binary objects.

For example, if two processes are on the same machine, then the out of band communication could take place over shared memory. If two processes are on different machines, then they could use remote DMA.

In the case of snprintf, we might have to modify the program like this:
blob_t *buf = allocate_blob(64);
int len = snprintf(buf, sizeof_blob(buf), "The answer is %d", 42);
/* do something with buf */
The "blob" indirection can be inserted by the compiler transparently.


IPC calls can be annotated with Cilk-like spawn and sync so that they may take place in parallel. Indeed, a parallel program is I/O bound when it's waiting for results from a different process.

Other optimizations

  • Amortize the cost of transport establishment by reusing existing connections across function calls.
  • Amortize the cost of out of band communication establishment across the creation and modification of blobs.
  • Channel compression.
  • Scheduling.


This is just an overview of an interprocess communication designed from scratch. Distributed computing has been heavily researched. I took inspiration from some existing designs but used the motivating example as a guideline to eliminate designs that are irrelevant.

Monday, April 1, 2013

Making of a bookmarklet to rot13 a web page

Today Slashdot is annoyingly encoded as rot13. Despite a free preview, I want to see all summaries at once like before.

Here is a piece of Javascript that traverses the DOM tree of a document and applies rot13 to all text nodes. This primitive version actually replaces the text within script and style elements as well, but I'm too lazy to filter them out. Update (2013/4/29): added the ability to filter script and style.
(function() {
  var skipTag = {'SCRIPT': 1, 'STYLE': 1}

  function applyTextNode(node, f) {
    if (node.nodeType == node.ELEMENT_NODE && node.tagName in skipTag)
    if (node.nodeType == node.TEXT_NODE)
      return f(node);
    for (var curr = node.firstChild; curr; curr = curr.nextSibling)
      applyTextNode(curr, f);

  function rot13(c) {
    var i = c.charCodeAt(0);
    if (i >= 65 && i <= 90) {
      i = ((i - 65) + 13) % 26 + 65;
    } else if (i >= 97 && i <= 122) {
      i = ((i - 97) + 13) % 26 + 97;
    return String.fromCharCode(i);

  applyTextNode(document.documentElement, function (node) { =[A-Za-z]/g, rot13);
And here is the "UglifyJS" version.
(function(){function b(c,d){if(!(c.nodeType==c.ELEMENT_NODE&&c.tagName in a)){if(c.nodeType==c.TEXT_NODE)return d(c);for(var e=c.firstChild;e;e=e.nextSibling)b(e,d)}}function c(a){var b=a.charCodeAt(0);return b>=65&&90>=b?b=(b-65+13)%26+65:b>=97&&122>=b&&(b=(b-97+13)%26+97),String.fromCharCode(b)}var a={SCRIPT:1,STYLE:1};b(document.documentElement,function(a){[A-Za-z]/g,c)})})();
And finally, to make it a bookmarklet, we need to encodeURI() and prepend the javascript: scheme, which gives:

Saturday, March 30, 2013

Improved binary JSON encoding

I like MsgPack because its semantics correspond well with JSON, and the encoding is extremely compact. One reason of the compactness is that it encodes small positive and negative integers as is, and arrays, maps, and strings of small sizes have dedicated tags.

While the "fixed" representation works well, MsgPack lacks the representation to stream large amounts of data. Streaming requires the encoding to be chunked, which MsgPack doesn't support. Chunking of string is done through string groups where string pieces within the string group markers are considered a single string. Similar group markers also exist for array and map.

Here is a MsgPack inspired wire format that allows chunking. The following list describes the encoding of an object.
  • Tags 0x00 through 0x7F : positive fixnum 0 through 127.
  • Tags 0x80 through 0x9F : short string or raw bytes of lengths 0 through 31, not chunked.
    • Followed by n bytes of the string data.
  • Tags 0xA0 through 0xA9 reassigned (edited: 11/13/2013)
    • Tags 0xA0 through 0xA7 : fixed array of lengths 0 through 7 respectively.
      • Followed by n objects.
    • Tag 0xA8 : end of string data. This is optional and implied by any tag that is not 0xA9.
    • Tag 0xA9 : start or continuation of a string or raw bytes chunk.
      • Followed by chunk length encoded as a varint. (edited: 9/27/2013)
      • Followed by an object parseable as an integer, interpreted as chunk length n.
      • Followed by n bytes of string data.
  • Tags 0xA0 through 0xA5: (reserved)
  • Tag 0xA6: big string. (new: 11/13/2013)
    • Followed by an object parseable as an integer for the string length.
    • Followed by n bytes of string data.
  • Tag 0xA7: packed numeric array. (new: 11/13/2013)
    • Followed by an object parseable as an integer n, for the total number of bytes of the data.
    • Followed by an object parseable as an integer k, for the type specifier.
      • Big endian integers:
        • 0: uint8, 1: uint16, 2: uint32, 3: uint64. (unsigned)
        • 4: int8, 5: int16, 6: int32, 7: int64. (signed)
      • Little endian integers:
        • 8: uint8, 9: uint16, 10: uint32, 11: uint64. (unsigned)
        • 12: int8, 13: int16, 14: int32, 15: int64. (signed)
      • Big endian IEEE 754 floats.
        • 16: half, 17: single, 18: double, 19: quad
      • Little endian IEEE 754 floats.
        • 20: half, 21: single, 22: double, 23: quad
      • TBD: unicode? pixel format (packed numeric tuples)?
    • Note that the number of items can be obtained by dividing n by the size of each item implied by k.
    • Followed by a short string object of lengths 0 through 7 (actual object length is 1 through 8). This string is only used for padding and is ignored.
    • Followed by n bytes of actual data.
  • Tag 0xA8: string group begin. (new: 11/13/2013)
    • The strings that follow until "string group end" will be concatenated into a single string.
  • Tag 0xA9: string group end. (new: 11/13/2013)
  • Tag 0xAA: array group begin.
    • The objects that follow until "array group end" are items of the array.
  • Tag 0xAB: array group end.
  • Tag 0xAC: map group begin.
    • The objects that follow until "map group end" are grouped into key-value pairs of the map.
  • Tag 0xAD: map group end.
  • Tags 0xAE and 0xAF : (reserved) (edited: 9/28/2013).
  • Tags 0xAE and 0xAF: big-endian and little-endian struct with edit map. See Unified Binary JSON and Cap'n Proto Wire Format.
  • Tag 0xB0 : null.
  • Tag 0xB1 : abstract data type (new: 9/27/2013).
    • Followed by an object representing the constructor name which is typically a string.
    • Followed by an object representing the value, which can be any parseable object as the argument to the constructor.
  • Tag 0xB2 : boolean false.
  • Tag 0xB3 : boolean true.
  • Tag 0xB4 : unsigned int32.
    • Followed by a big-endian 32-bit unsigned integer.
  • Tag 0xB5 : signed int32.
    • Followed by a big-endian 32-bit signed integer.
  • Tag 0xB6 : unsigned int64.
    • Followed by a big-endian 64-bit unsigned integer.
  • Tag 0xB7 : signed int64.
    • Followed by a big-endian 64-bit signed integer.
  • Tags 0xB8 through 0xBB: (reserved)
  • Tag 0xBC : float.
    • Followed by IEEE 754 32-bit single precision float in big-endian.
  • Tag 0xBD : double.
    • Followed by IEEE 754 64-bit double precision float in big-endian
  • Tag 0xBE : varint base128.
    • Followed by a varint.
  • Tag 0xBF : signed varint base128 (zigzag).
    • Followed by a signed zigzag varint.
  • Tags 0xC0 through 0xFF: negative fixnum -64 through -1.
There are fewer tags needed than MsgPack, so we can extend negative fixnum to -64.

Strings are 8-bit clean with an unspecified encoding. Most applications should use UTF-8.

Both array and map allow nesting. As long as the begin and end markers are properly nested, the parser does not care how many items there are.

Integers larger than 128 are better encoded using varint, unless it is determined that 32-bit or 64-bit encoding is more efficient.

If a map is used to serialize a protocol buffer message with non-packed repeatable fields, the key-value pairs are simply repeated with multiple occurrences of the values, without using an array. In language bindings that do not recognize repeated key-value pairs in a map, the last key-value will overwrite previous occurrences.