Saturday, March 30, 2013

Improved binary JSON encoding

I like MsgPack because its semantics correspond well with JSON, and the encoding is extremely compact. One reason of the compactness is that it encodes small positive and negative integers as is, and arrays, maps, and strings of small sizes have dedicated tags.

While the "fixed" representation works well, MsgPack lacks the representation to stream large amounts of data. Streaming requires the encoding to be chunked, which MsgPack doesn't support. Chunking of string is done through string groups where string pieces within the string group markers are considered a single string. Similar group markers also exist for array and map.

Here is a MsgPack inspired wire format that allows chunking. The following list describes the encoding of an object.
  • Tags 0x00 through 0x7F : positive fixnum 0 through 127.
  • Tags 0x80 through 0x9F : short string or raw bytes of lengths 0 through 31, not chunked.
    • Followed by n bytes of the string data.
  • Tags 0xA0 through 0xA9 reassigned (edited: 11/13/2013)
    • Tags 0xA0 through 0xA7 : fixed array of lengths 0 through 7 respectively.
      • Followed by n objects.
    • Tag 0xA8 : end of string data. This is optional and implied by any tag that is not 0xA9.
    • Tag 0xA9 : start or continuation of a string or raw bytes chunk.
      • Followed by chunk length encoded as a varint. (edited: 9/27/2013)
      • Followed by an object parseable as an integer, interpreted as chunk length n.
      • Followed by n bytes of string data.
  • Tags 0xA0 through 0xA5: (reserved)
  • Tag 0xA6: big string. (new: 11/13/2013)
    • Followed by an object parseable as an integer for the string length.
    • Followed by n bytes of string data.
  • Tag 0xA7: packed numeric array. (new: 11/13/2013)
    • Followed by an object parseable as an integer n, for the total number of bytes of the data.
    • Followed by an object parseable as an integer k, for the type specifier.
      • Big endian integers:
        • 0: uint8, 1: uint16, 2: uint32, 3: uint64. (unsigned)
        • 4: int8, 5: int16, 6: int32, 7: int64. (signed)
      • Little endian integers:
        • 8: uint8, 9: uint16, 10: uint32, 11: uint64. (unsigned)
        • 12: int8, 13: int16, 14: int32, 15: int64. (signed)
      • Big endian IEEE 754 floats.
        • 16: half, 17: single, 18: double, 19: quad
      • Little endian IEEE 754 floats.
        • 20: half, 21: single, 22: double, 23: quad
      • TBD: unicode? pixel format (packed numeric tuples)?
    • Note that the number of items can be obtained by dividing n by the size of each item implied by k.
    • Followed by a short string object of lengths 0 through 7 (actual object length is 1 through 8). This string is only used for padding and is ignored.
    • Followed by n bytes of actual data.
  • Tag 0xA8: string group begin. (new: 11/13/2013)
    • The strings that follow until "string group end" will be concatenated into a single string.
  • Tag 0xA9: string group end. (new: 11/13/2013)
  • Tag 0xAA: array group begin.
    • The objects that follow until "array group end" are items of the array.
  • Tag 0xAB: array group end.
  • Tag 0xAC: map group begin.
    • The objects that follow until "map group end" are grouped into key-value pairs of the map.
  • Tag 0xAD: map group end.
  • Tags 0xAE and 0xAF : (reserved) (edited: 9/28/2013).
  • Tags 0xAE and 0xAF: big-endian and little-endian struct with edit map. See Unified Binary JSON and Cap'n Proto Wire Format.
  • Tag 0xB0 : null.
  • Tag 0xB1 : abstract data type (new: 9/27/2013).
    • Followed by an object representing the constructor name which is typically a string.
    • Followed by an object representing the value, which can be any parseable object as the argument to the constructor.
  • Tag 0xB2 : boolean false.
  • Tag 0xB3 : boolean true.
  • Tag 0xB4 : unsigned int32.
    • Followed by a big-endian 32-bit unsigned integer.
  • Tag 0xB5 : signed int32.
    • Followed by a big-endian 32-bit signed integer.
  • Tag 0xB6 : unsigned int64.
    • Followed by a big-endian 64-bit unsigned integer.
  • Tag 0xB7 : signed int64.
    • Followed by a big-endian 64-bit signed integer.
  • Tags 0xB8 through 0xBB: (reserved)
  • Tag 0xBC : float.
    • Followed by IEEE 754 32-bit single precision float in big-endian.
  • Tag 0xBD : double.
    • Followed by IEEE 754 64-bit double precision float in big-endian
  • Tag 0xBE : varint base128.
    • Followed by a varint.
  • Tag 0xBF : signed varint base128 (zigzag).
    • Followed by a signed zigzag varint.
  • Tags 0xC0 through 0xFF: negative fixnum -64 through -1.
There are fewer tags needed than MsgPack, so we can extend negative fixnum to -64.

Strings are 8-bit clean with an unspecified encoding. Most applications should use UTF-8.

Both array and map allow nesting. As long as the begin and end markers are properly nested, the parser does not care how many items there are.

Integers larger than 128 are better encoded using varint, unless it is determined that 32-bit or 64-bit encoding is more efficient.

If a map is used to serialize a protocol buffer message with non-packed repeatable fields, the key-value pairs are simply repeated with multiple occurrences of the values, without using an array. In language bindings that do not recognize repeated key-value pairs in a map, the last key-value will overwrite previous occurrences.

No comments: