Saturday, September 28, 2013

Unified Binary JSON and Cap'n Proto Wire Format

Earlier I wrote about binary JSON wire format which is inspired by MsgPack but allows string chunking and compact representation of array and map. Then I wrote about Cap'n Proto which transmits fields of a struct as it is laid out in memory so it needs to encoding and decoding, but I lamented that the wire format gets complicated because you actually have to do some memory management in order to cope with strings and repeated fields. Then it occurred to me, why not take the best of both worlds?

The idea is that, after transmitting the memory layout of a struct, the data would be followed by a series of editing instructions for populating non-integral fields such as arrays and strings and other objects. Since the memory layout may need to be aligned, we need to allow padding so that we can access structs through e.g. memory mapped I/O.

For the sake of simplicity, let's consider a struct that contains only int64 fields and object pointers. Smaller integers are packed as bit fields, e.g. two int32's per int64. Object pointers are only used for strings, arrays, or other composite objects. On a 32-bit machine, only the immediate half of the int64 field is used for the object pointer.

In the previous proposal, tags 0xAE and 0xAF are reserved, so we can use them for encoding struct with in-memory layout. The logical wire stream would appear like this:

  • Tag 0xAE: struct with big-endian in-memory layout.
    • Followed by an integer object n for the number of int64 fields in the struct.
    • Followed by a short string object of lengths 0 through 7 (actual object length is 1 through 8). This string is only used for padding and is ignored.
    • Followed by n * 8 bytes of struct field data in big endian.
    • Followed by the edit map object whose keys are field numbers and values are objects to be stored at the field. Repeated keys may be used to populate a repeated field in the same order the values occur.
  • Tag 0xAF: like 0xAE, except struct field data are little-endian.
In the case that the reader is given an a-priori schema, it can perform the field-editing on the fly according to the schema and reject bad values in the edit map. In the case there is no schema, the reader might save the edit map and postpone the editing for later until the struct is bound with a schema. The reader may attempt schema-less on the fly editing by keeping track of value type conflicts for a given field number. Without a schema, the types of the missing fields would be assumed to be int64 unless overridden.

As far as the writer is concerned, a struct object can be written either as big-endian or little-endian, and it can optionally be written out simply as a map (as opposed to 0xAE or 0xAF tagged objects) to maximize schema-less compatibility and/or efficiency in encoding.