Life of a Computer Scientist: April 2013

Saturday, April 27, 2013

Using Git for Making Law

The other day I was reading the transcript of the Constitution of the United States, and I noticed annotations in the body whenever a paragraph is changed by an amendment. In a sense, legal text has developed a process of "patching" similar to how programmers manage source code of a software system. A "patch" in the revision control sense is a precise editing instruction for insertion and deletion of text, and that allows a reviewer to easily tell exactly what has been changed.

There are many revision control systems, but I'm only going to write about this one called Git because that's the one I know more about. Git is a particularly interesting revision control system in that it enjoys the following properties.

The repository is distributed. A complete revision history is replicated among people who choose to clone a repository. This includes all objects that are required to reconstruct a given snapshot some point in the history.
Revision is structured as a tree-like lineage of the SHA1 cryptographic digest which serves the purpose of integrity checking. That means any modification to the history will result in a different lineage because the digest will be different. Forging SHA1 digest is a computationally expensive process that requires brute-forcing at least \( 2^{61} \) possibilities to find a collision.
Each revision, called a "commit", identifies authorship of the change. Furthermore, a commit can be signed using asymmetric cryptography. This allows a person other than the author of the changes to endorse a change. The endorsement is also computationally hard to forge and can be easily verified by a computer.

It turns out that Joel C. Salomon had attempted to represent the US constitution as a Git repository as an exercise, after a proposal on Hacker News. I think this exercise would become even more interesting if he had incorporated cryptographic signing to his approach.

When using asymmetric cryptography, a pair of keys are made that are mathematically associated, the public key and the secret (private) key. For encryption, the public key is used to encrypt, and the secret key is used to decrypt. For signing, the secret key is used to sign, and the public key is used to verify the signature. The public key is what everyone else has to trust, while the secret key most be kept safely guarded or the trust would be breached.

In effect, using Git, we can develop a totally democratic legal system where:

Anyone can clone the text of law and make modifications to their own drafts. They can also see how other drafts are formulated over time and who wrote what section.
The complete lineage of legal system development will be of interest to legal historians.
Only revisions that are signed by a trusted public key will become legally effective.

Of course, this brings the question of how to establish trust. Normal individuals can use a Web of Trust system where a group will endorse each other's public key for trustworthiness. People could organize a key-signing party where they will meet each other in person and endorse the public keys of the people they met. For the purpose of public affair, it's probably easier to organize trust by certificate authority. However, regardless of how trust is established, the public key is only as trustworthy as the manner how the secret key is guarded.

As a corollary, if a secret key is stored on a computer, then the public key is only as trustworthy as the computer that stores the secret key. We all know how computers are notoriously difficult to secure. It then makes sense to have a limited-purpose computer to store secret keys, and reduce the problem to physical security of that computer. Although secret keys can also be encrypted with a passphrase, brute-forcing the passphrase is still much easier than decrypting a message or forging a signature without the secret key.

And then I found out about smart cards that are essentially the limited-purpose computer. Git uses GnuPG for revision signing, and there are two main card types that can be used with GnuPG. One is the commercially available PKCS#15 cards supported by OpenSC (via gnupg-pkcs11), and the other is OpenPGP card developed by g10code. Gooze gives certain free software developers a free Feitian PKI smart card which is a PKCS#15 card. Free Software Foundation Europe gives every fellowship member an Fellowship Smartcard which is a branded OpenPGP card. You can also buy OpenPGP cards from Kernel Concepts in Germany.

These cryptography smart cards are interesting because they can generate the public and secret key pair on the card, and the secret key never leaves the card. The signing and decryption happens on the card itself. The card is protected by a passphrase similar to how secret key is protected. In addition, the card will only allow a limit number of tries before locking itself out, so it improves the physical security.

However, if the secret key never leaves the smart card, then losing the smart card will lose the secret key which is a big deal because the web of trust has to be established from scratch again. To deal with key management problems, some people create several key pairs, a master key pair used for signing only, and several subkeys, one for encryption, one for signing, and one for authentication. Trust is established for the master key, which is used to endorse the subkeys.

Note that FSFE does not recommend using the smart card for the master key, but only for subkeys. I'm not sure what their rationale is, but I think it should be the other way around.

First create a master key pair on the primary smart card. Use it to endorse subkeys. The subkeys can be stored on a relatively insecure computer for daily use, or on a secondary smart card that you carry around. Keep the master key smart card in a safe place like a bank vault. Only use the master key when revoking an existing subkey, endorsing new subkeys, or endorsing other people's master public key.

The technology to revolutionize the editorial process of law making is readily available and in practical use. Git can be used to track revision lineage, and asymmetric cryptography can be used to given authority on a particular revision. Practical means to establish and secure trust is available in the form of a smart card, and there are effective means of key management that minimizes the chance of losing a key. It is finally time for law making which existed since the beginning of written history to enter the informational age!

Saturday, April 13, 2013

Notes about Interprocess Communication

Motivation

There are many distributed systems and IPC designs, but forget about them for a moment. Let's design an IPC from scratch. The basic idea is to take a procedure call like this:

char buf[64];
int len = snprintf(buf, sizeof(buf), "The answer is %d", 42);

And take snprintf() out transparently to a separate process. One reason for doing that is that we may have a buggy snprintf() implementation which crashes 1 out of 40 times, and we don't want snprintf() to crash our program. By taking snprintf() out to a separate process, we now also have the liberty to retry when that buggy code crashes. Process isolation means that the buggy code cannot corrupt our memory. We are now fault isolated.

Why do we want fault isolation? Let's say I program in ATS, and I proved that my program can never crash. But I need to use third-party libraries which are only as reliable as (fill in a witty expletive here), and that makes my program just as reliable.

Message passing

Before we can call snprintf() in an isolated process, the runtime system needs to answer a few questions.

Discovery: where do I find a process that implements snprintf()?
Message passing:

How do I request that snprintf() be called?
How do I retrieve the result from snprintf() when it is done?

For discovery, there are several ways:

Start the process as a child process yourself, and become its babysitter so that you'd restart it when it crashes.
Share this process with other processes, and let a system-wide babysitter handle program crashes.

This shared process might be on a different computer across the network. You use a name service to find it.

For message passing, first you need a transport, then a wire format. For the transport, you'd probably just open a socket. Local processes can use Unix domain socket, and remote processes can use TCP/IP socket. You can also use UDP if you want to handle potential packet loss yourself. You can also use TLS sockets if you want authentication and encryption. The transport might provide information like remote address and credentials for the purpose of authorization. Authorization is done by the program in order to decide whether to accept or reject the message.

The wire format consists of a frame and serialization of messages. You need a frame so the receiver knows how many bytes to expect. Framing might not be necessary for a datagram socket that already handles frames as part of the underlying protocol. Framing also might not be necessary if the underlying message serialization has an unambiguous length. For the frame, it suffices to just have a message length followed by that number of message bytes. You can use a varint to make the length flexible. If you use a fixed size integer, you're also deciding the effective maximum message size at this point. The message itself would be a byte sequence that serializes detail of the message.

For the purpose of IPC, the request message encodes the function name and arguments, and the response message encodes the return value. In the snprintf() case, the request message could be just ["snprintf", address_of(buf), 64, "The answer is %d", 42], and the response message might be [16].

Out of band communication

Here is an interesting problem. If snprintf() is now in a separate memory isolated process, how could it modify our buffer? Some IPC design forces snprintf() to return the resulting string as part of the response message. This might work okay if the result size is small. If the result consists of mostly large binary array, e.g. a pixel buffer, then the overhead to encode, copy over transport, and decode would be too high. In this case, it is better to use message passing as the "control channel" and use an optimized out of band communication method to deliver large binary objects.

For example, if two processes are on the same machine, then the out of band communication could take place over shared memory. If two processes are on different machines, then they could use remote DMA.

In the case of snprintf, we might have to modify the program like this:

blob_t *buf = allocate_blob(64);
int len = snprintf(buf, sizeof_blob(buf), "The answer is %d", 42);
/* do something with buf */
free_blob(buf);

The "blob" indirection can be inserted by the compiler transparently.

Parallelism

IPC calls can be annotated with Cilk-like spawn and sync so that they may take place in parallel. Indeed, a parallel program is I/O bound when it's waiting for results from a different process.

Other optimizations

Amortize the cost of transport establishment by reusing existing connections across function calls.
Amortize the cost of out of band communication establishment across the creation and modification of blobs.
Channel compression.
Scheduling.

Conclusion

This is just an overview of an interprocess communication designed from scratch. Distributed computing has been heavily researched. I took inspiration from some existing designs but used the motivating example as a guideline to eliminate designs that are irrelevant.

Monday, April 1, 2013

Making of a bookmarklet to rot13 a web page

Today Slashdot is annoyingly encoded as rot13. Despite a free preview, I want to see all summaries at once like before.

Here is a piece of Javascript that traverses the DOM tree of a document and applies rot13 to all text nodes. ~~This primitive version actually replaces the text within script and style elements as well, but I'm too lazy to filter them out.~~ Update (2013/4/29): added the ability to filter script and style.

(function() {
  var skipTag = {'SCRIPT': 1, 'STYLE': 1}

  function applyTextNode(node, f) {
    if (node.nodeType == node.ELEMENT_NODE && node.tagName in skipTag)
      return;
    if (node.nodeType == node.TEXT_NODE)
      return f(node);
    for (var curr = node.firstChild; curr; curr = curr.nextSibling)
      applyTextNode(curr, f);
  }

  function rot13(c) {
    var i = c.charCodeAt(0);
    if (i >= 65 && i <= 90) {
      i = ((i - 65) + 13) % 26 + 65;
    } else if (i >= 97 && i <= 122) {
      i = ((i - 97) + 13) % 26 + 97;
    }
    return String.fromCharCode(i);
  }

  applyTextNode(document.documentElement, function (node) {
    node.data = node.data.replace(/[A-Za-z]/g, rot13);
  });
})();

And here is the "UglifyJS" version.

(function(){function b(c,d){if(!(c.nodeType==c.ELEMENT_NODE&&c.tagName in a)){if(c.nodeType==c.TEXT_NODE)return d(c);for(var e=c.firstChild;e;e=e.nextSibling)b(e,d)}}function c(a){var b=a.charCodeAt(0);return b>=65&&90>=b?b=(b-65+13)%26+65:b>=97&&122>=b&&(b=(b-97+13)%26+97),String.fromCharCode(b)}var a={SCRIPT:1,STYLE:1};b(document.documentElement,function(a){a.data=a.data.replace(/[A-Za-z]/g,c)})})();

And finally, to make it a bookmarklet, we need to encodeURI() and prepend the javascript: scheme, which gives:

javascript:(function()%7Bfunction%20b(c,d)%7Bif(!(c.nodeType==c.ELEMENT_NODE&&c.tagName%20in%20a))%7Bif(c.nodeType==c.TEXT_NODE)return%20d(c);for(var%20e=c.firstChild;e;e=e.nextSibling)b(e,d)%7D%7Dfunction%20c(a)%7Bvar%20b=a.charCodeAt(0);return%20b%3E=65&&90%3E=b?b=(b-65+13)%2526+65:b%3E=97&&122%3E=b&&(b=(b-97+13)%2526+97),String.fromCharCode(b)%7Dvar%20a=%7BSCRIPT:1,STYLE:1%7D;b(document.documentElement,function(a)%7Ba.data=a.data.replace(/%5BA-Za-z%5D/g,c)%7D)%7D)();

Enjoy!