Saturday, March 30, 2013

Improved binary JSON encoding

I like MsgPack because its semantics correspond well with JSON, and the encoding is extremely compact. One reason of the compactness is that it encodes small positive and negative integers as is, and arrays, maps, and strings of small sizes have dedicated tags.

While the "fixed" representation works well, MsgPack lacks the representation to stream large amounts of data. Streaming requires the encoding to be chunked, which MsgPack doesn't support. Chunking of string is done through string groups where string pieces within the string group markers are considered a single string. Similar group markers also exist for array and map.

Here is a MsgPack inspired wire format that allows chunking. The following list describes the encoding of an object.
  • Tags 0x00 through 0x7F : positive fixnum 0 through 127.
  • Tags 0x80 through 0x9F : short string or raw bytes of lengths 0 through 31, not chunked.
    • Followed by n bytes of the string data.
  • Tags 0xA0 through 0xA9 reassigned (edited: 11/13/2013)
    • Tags 0xA0 through 0xA7 : fixed array of lengths 0 through 7 respectively.
      • Followed by n objects.
    • Tag 0xA8 : end of string data. This is optional and implied by any tag that is not 0xA9.
    • Tag 0xA9 : start or continuation of a string or raw bytes chunk.
      • Followed by chunk length encoded as a varint. (edited: 9/27/2013)
      • Followed by an object parseable as an integer, interpreted as chunk length n.
      • Followed by n bytes of string data.
  • Tags 0xA0 through 0xA5: (reserved)
  • Tag 0xA6: big string. (new: 11/13/2013)
    • Followed by an object parseable as an integer for the string length.
    • Followed by n bytes of string data.
  • Tag 0xA7: packed numeric array. (new: 11/13/2013)
    • Followed by an object parseable as an integer n, for the total number of bytes of the data.
    • Followed by an object parseable as an integer k, for the type specifier.
      • Big endian integers:
        • 0: uint8, 1: uint16, 2: uint32, 3: uint64. (unsigned)
        • 4: int8, 5: int16, 6: int32, 7: int64. (signed)
      • Little endian integers:
        • 8: uint8, 9: uint16, 10: uint32, 11: uint64. (unsigned)
        • 12: int8, 13: int16, 14: int32, 15: int64. (signed)
      • Big endian IEEE 754 floats.
        • 16: half, 17: single, 18: double, 19: quad
      • Little endian IEEE 754 floats.
        • 20: half, 21: single, 22: double, 23: quad
      • TBD: unicode? pixel format (packed numeric tuples)?
    • Note that the number of items can be obtained by dividing n by the size of each item implied by k.
    • Followed by a short string object of lengths 0 through 7 (actual object length is 1 through 8). This string is only used for padding and is ignored.
    • Followed by n bytes of actual data.
  • Tag 0xA8: string group begin. (new: 11/13/2013)
    • The strings that follow until "string group end" will be concatenated into a single string.
  • Tag 0xA9: string group end. (new: 11/13/2013)
  • Tag 0xAA: array group begin.
    • The objects that follow until "array group end" are items of the array.
  • Tag 0xAB: array group end.
  • Tag 0xAC: map group begin.
    • The objects that follow until "map group end" are grouped into key-value pairs of the map.
  • Tag 0xAD: map group end.
  • Tags 0xAE and 0xAF : (reserved) (edited: 9/28/2013).
  • Tags 0xAE and 0xAF: big-endian and little-endian struct with edit map. See Unified Binary JSON and Cap'n Proto Wire Format.
  • Tag 0xB0 : null.
  • Tag 0xB1 : abstract data type (new: 9/27/2013).
    • Followed by an object representing the constructor name which is typically a string.
    • Followed by an object representing the value, which can be any parseable object as the argument to the constructor.
  • Tag 0xB2 : boolean false.
  • Tag 0xB3 : boolean true.
  • Tag 0xB4 : unsigned int32.
    • Followed by a big-endian 32-bit unsigned integer.
  • Tag 0xB5 : signed int32.
    • Followed by a big-endian 32-bit signed integer.
  • Tag 0xB6 : unsigned int64.
    • Followed by a big-endian 64-bit unsigned integer.
  • Tag 0xB7 : signed int64.
    • Followed by a big-endian 64-bit signed integer.
  • Tags 0xB8 through 0xBB: (reserved)
  • Tag 0xBC : float.
    • Followed by IEEE 754 32-bit single precision float in big-endian.
  • Tag 0xBD : double.
    • Followed by IEEE 754 64-bit double precision float in big-endian
  • Tag 0xBE : varint base128.
    • Followed by a varint.
  • Tag 0xBF : signed varint base128 (zigzag).
    • Followed by a signed zigzag varint.
  • Tags 0xC0 through 0xFF: negative fixnum -64 through -1.
There are fewer tags needed than MsgPack, so we can extend negative fixnum to -64.

Strings are 8-bit clean with an unspecified encoding. Most applications should use UTF-8.

Both array and map allow nesting. As long as the begin and end markers are properly nested, the parser does not care how many items there are.

Integers larger than 128 are better encoded using varint, unless it is determined that 32-bit or 64-bit encoding is more efficient.

If a map is used to serialize a protocol buffer message with non-packed repeatable fields, the key-value pairs are simply repeated with multiple occurrences of the values, without using an array. In language bindings that do not recognize repeated key-value pairs in a map, the last key-value will overwrite previous occurrences.

Sharing LLVM and cilkplus LLVM in a single git clone

I want to track the development of both LLVM and cilkplus LLVM in git, but I don't want to have two git clones with essentially the same content. So I took advantage of the ability that git could pull from multiple remote repositories.

First I pull LLVM like this:
git clone
git config branch.master.rebase true
And then heeding the advice to pull-push from multiple remote locations, I add cilkplus LLVM like this:
git remote add cilkplus git://
git remote update
Now I have two remote branches, cilkplus/cilkplus and cilkplus/master as well as branches from origin.
$ git branch -r
  origin/HEAD -> origin/master
I could check them out directly either in the detached head state, or I could create a local branch to track it.
git branch --track cilkplus cilkplus/cilkplus
git checkout cilkplus
git pull
Now I end up with the list of branches like this:
$ git branch -a
* cilkplus
  remotes/origin/HEAD -> origin/master
Repeat this for clang and compiler-rt.

Saturday, March 2, 2013

Machine learning and its impact on human learning

Machine learning is a technique for finding model within a large dataset. It is used by IBM to create Watson that can answer questions in Jeopardy using knowledge it harvested from the web. It is used by Google to translate texts between one natural language to another, by finding traits between already translated corpus. It is used by financial firms to analyze stock market or other markets by looking at historical data, as well as to detect fraud. It is used by insurance companies to calculate risks. It is also used in scientific research, such as analyzing data from Large Hadron Collider. Machine learning is enabled by the massive computing power that big companies are able to build from the large quantities of computers that modern manufacturing process is able to churn out at incredible speed.

Machine learning, in short, is outsourcing research to machines. Those are research whose results have practical uses, but the process of data analysis to find pattern and formulate a model to explain the phenomenon is mind-numbingly tedious. However, machines can't figure out the research methodology itself. Humans preprogram the machine with a generic model, and machine learning finds a substructure within that generic model that can be used to predict phenomenon. If the generic model doesn't predict well, it is generalized even further. This becomes a trial and error process on the part of human.

The reason a generic model might not work well is because every model has limits in its expressive power. For example, Linear Algebra (in which variables relate to one another as a simple sum of coefficients multiplied by first-degree variable occurrences, e.g. \(x = 2y + 3z\)) cannot fit data that are quadratic in nature (e.g. \(x = y^2 + z^2\)), and quadratic model cannot fit data that are cubic in nature (e.g. \(x = z^3\)). A generic polynomial model cannot fit data that are exponential or transcendental in nature, although Taylor Series predict that a polynomial model can be used to approximately model and interpolate data but cannot extrapolate them beyond a certain boundary.

The more general the model is, a disproportionally amount of computing power increase is required to find submodel, to a point where the computation is intractable (e.g. integer linear programming is NP-complete) or just unsolvable.

Also, even if data is known to fit in a simple computable model, finding the right sub-model requires a stroke of luck. For example, simulated annealing is a machine learning technique that starts off the model searching with random parameters in the model and iteratively converge these parameters to find ones that fit data. Two annealing attempts can yield drastically different parameters that both fit existing data but disagree in predicting future phenomenon.

Some of these models can have thousands or millions of parameters. After a model is found, it is extremely hard to explain why the data fits the model. The effectiveness of the model is evaluated by having two corpuses, one for training and one for verification, where each corpus is a set of data that associates an input to an expected output. The training corpus is used to build a model, and the verification corpus is used to evaluate the model based on inputs that the model has never seen before.

There is very little understanding of how the model works, and less on how to explain the phenomenon. For example, a stock market model can be used to predict stock trends, but the model does not tell us there are various actors in the market—sellers and buyers, short-term and long-term investors, actos with large capital and actos with small capital—and how their behaviors influence the market. Language translation using machine learning does not help us understand why natural languages evolved the way it did.

Machine learning is like using proprietary technology. It's a service that we don't care how it works. The problem is solved but we don't know how. Since no insight is gained, we cannot build on top of this understanding to change the world. For example, stock market model does not help shaping financial regulation because it does not explain the behavior of the people controlling the capital. Machine language translation model does not help us improve a language so that it helps people express ideas and communicate better.

In a way, machine learning takes thoughtful problem solving out of the research. We become complacent in thinking. When there is a hard problem, throw more computing iron behind it, and let the computer find an answer for us. We should take inspiration from the story of a crow drinking water from a pitcher, by throwing pebbles into the pitcher in order to raise the water level enough so that he can drink the water. The moral of the story is the virtue of thoughtfulness over brute strength.

Let us not forget that thoughtful problem solving is part of what makes us human, and not lose that quality because of computers.