Thursday, March 17, 2016

Protobuf like wire encoding inspired by MsgPack

I had previously mused about binary JSON encoding and contemplated unifying it with flat memory structures inspired by Cap'n Proto. It turns out that the most interesting uses of proto structures are the optional and repeated fields as well as other variable length data, so much that proto3 has eliminated required fields but added support for maps. Flat memory layout does not present variable length data well, and even so, some encoding and decoding are still necessary.

Hence I've decided to take another look, this time applying some ideas of MsgPack to encode protobuf wire data. The traditional protobuf encoding allows 8 distinct type tags (4 used, 2 deprecated, 2 more are reserved). The used ones are varint (0), 64-bit scalar (1), length-delimited (2), and 32-bit scalar (5).

Each field is packed as (field_num << 3) | type_tag as a varint key followed by whatever the value is indicated by the type tag. The largest field number that can be represented in a single byte varint is 15 (largest single digit base128 is 127, divided by 8 for 3 bits used by the type tag). In practice, moderate sized proto messages would need 2 bytes for the field keys.

Again, drawing from the ideas behind MsgPack, a single-byte can be appropriated for a range of small integers and moderately sized type tags that can encode short lengths. Decoupling the field number and type tag makes it convenient to express maps very efficiently, but many messages can still be encoded in the same number of bytes or fewer.

To rehash, here are the ranges for small integers.

  • 0x00 .. 0x7F: zero and positive integers 0 through 127.
  • 0xC0 .. 0xFF: negative integers -64 through -1.
We allocate the range 0x80 .. 0xBF as follows:
  • 0x80 .. 0xAF: variable bytes with lengths 0 through 47.
    • followed by this many number of bytes for the actual data.
  • 0xB0 .. 0xB4: single scalar of bit sizes 8, 16, 32, 64, and 128 (little endian).
    • these can be interpreted as signed or unsigned integers or floating point numbers.
  • 0xB5 .. 0xB7: reserved.
  • 0xB8: nil or null
  • 0xB9: varint
  • 0xBA: variable bytes
    • followed by a varint for the length.
    • followed by this many number of bytes for the actual data.
  • 0xBB .. 0xBF: reserved.
There are still plenty of reserved type tags for future use, though I've left out some important ones such as start and end of a list or a map, but opted to wrap them inside variable length data for lazy decoding. This means unlike MsgPack where the schema can be discerned from the encoding, the encoding here requires a separate schema in order to be fully interpreted.

A top level message is encoded like a map with the sequence \( k_1, v_1, k_2, v_2, ... \) but all the keys must represent integers. A map, on the other hand, can have non-integer key types. A list is simply a sequence of values \( v_1, v_2, ... \). Nested messages, maps, and lists are encapsulated inside variable length data like strings.

All variable length data store the bytes contiguously so the memory region can be referenced directly, for zero-copy decoding. I've decided to forego chunked encoding for now, although chunks can also be referenced in memory as a "cord" as opposed to a string.

Note that it is also possible to implement zero-copy and lazy decoding for the original protobuf wire encoding as well, but I'm not aware of any implementation that does that.

Here is an alternative encoding that restores the ability to discern schema from encoding but preserve the ability to do lazy decoding.
  • 0x80 .. 0x8E, 0x8F: byte string with lengths 0 through 14, and arbitrary length.
  • 0x90 .. 0x9E, 0x9F: lists with encoded byte lengths 0 through 14, and arbitrary length.
  • 0xA0 .. 0xAE, 0xAF: maps or nested messages, same deal.
  • 0xB0 .. 0xBF: like the encoding before.
The idea is that we represent byte strings, lists, and maps / nested messages in separate ranges. The lengths 0 through 14 are encoded directly in the type tag, and the 15 means the length follows the tag as a varint.

The problem still remains that we can't do one-pass zero-copy encoding of variable length data and come back to fix the length, since we might not have reserved enough space for the length.

No comments: