Saturday, May 25, 2013

Cap'n proto inspired encoding

I guess the fact that Kenton Varda, a principle author of Protocol Buffer and the guy who built the ultimate LAN party house, came up with Cap'n Proto specifically to address the problem of serialization and deserialization overhead, is worthy of an investigation. The idea is to have a wire format that can be dumped to some memory location and used directly without decoding. But the encoding of Cap'n Proto is too complex for me to understand, so I decided to come up with my own.

First I like the idea that each field within a struct is 64-bit word aligned. Large objects like nested structs, strings, and arrays are referenced by an offset. In order to support offsetting and chunking, the wire protocol divides data to segments. When you construct a struct, you allocate memory from the segment. This is where I think it got too complicated.

Rather, let me propose a simpler wire format. A struct consists of uniquely numbered fields. Each field is exactly 64-bit. Plain old integral data will be stored as is. They can be packed, e.g. four 16-bit integers or two 32-bit integers in a field. The 64-bit field size allows storing a pointer on a 64-bit machine. This pointer is a machine and implementation specific object pointer, and is meaningless when serialized. Here is where my proposal differs from Cap'n Proto.

Immediately after each struct are a series of "fix-ups" or "field population instructions" that are basically a sequence of (field number, data type, data length, data) tuples. The field number specifies the field of the struct to be populated. Data type specifies how to populate this field with data. Data length specifies the number of bytes, followed by the action data.

For example, to serialize a struct like this:
struct S {
  // x, y, u, v are packed into field 0.
  // packed subfields must be naturally aligned to
  // their respective size.
  0: int32 x;
     int16 y;
     uint8 u, v;

  1: list<string> xs;
};
Then we would first dump field 0 (which has x, y, u, v all packed together) and field 1 of S, even though field 1's value would be meaningless. However, for each string in xs, we'll output a fix-up that says "add this data as a string to field 1." The decoder of fix-up instructions can reuse the same read buffer to construct string pointer references, so the string content need not be copied.

When this wire format is decoded, the decoder allocates memory for S and dumps all the field values there. It would consult the definition of S and realizes that it has to initialize field 1 by allocating a list object to hold the strings. Then it would process through the fix-ups and add the strings as required.

This will work for nested structs as well. What I haven't quite figured out is dictionary. The encoding would not be very efficient if we encode the fix-up for the key, then the fix-up for the value. In fact, since all the keys are of the same type, and all the values are of the other type, it is more efficient to encode them as two parallel lists.

With this encoding scheme, I don't need to deal with that inter-segment pointer non-sense.

Earlier on I wrote about an improved binary JSON serialization format. I think there is still a use case for that. Basically if a programming language does not use raw struct layout in memory, then it won't be able to take advantage of the Cap'n Proto styled encoding, and having a compact wire format is probably better. Only C/C++ can really take advantage of Cap'n Proto styled encoding. Also, this raw encoding requires a schema in order to know how to interpret data. With binary JSON, the data is self-tagging and self-delimiting. You can manipulate data even if you don't have their schema.

No comments: