Thursday, March 17, 2016

Protobuf like wire encoding inspired by MsgPack

I had previously mused about binary JSON encoding and contemplated unifying it with flat memory structures inspired by Cap'n Proto. It turns out that the most interesting uses of proto structures are the optional and repeated fields as well as other variable length data, so much that proto3 has eliminated required fields but added support for maps. Flat memory layout does not present variable length data well, and even so, some encoding and decoding are still necessary.

Hence I've decided to take another look, this time applying some ideas of MsgPack to encode protobuf wire data. The traditional protobuf encoding allows 8 distinct type tags (4 used, 2 deprecated, 2 more are reserved). The used ones are varint (0), 64-bit scalar (1), length-delimited (2), and 32-bit scalar (5).

Each field is packed as (field_num << 3) | type_tag as a varint key followed by whatever the value is indicated by the type tag. The largest field number that can be represented in a single byte varint is 15 (largest single digit base128 is 127, divided by 8 for 3 bits used by the type tag). In practice, moderate sized proto messages would need 2 bytes for the field keys.

Again, drawing from the ideas behind MsgPack, a single-byte can be appropriated for a range of small integers and moderately sized type tags that can encode short lengths. Decoupling the field number and type tag makes it convenient to express maps very efficiently, but many messages can still be encoded in the same number of bytes or fewer.

To rehash, here are the ranges for small integers.

  • 0x00 .. 0x7F: zero and positive integers 0 through 127.
  • 0xC0 .. 0xFF: negative integers -64 through -1.
We allocate the range 0x80 .. 0xBF as follows:
  • 0x80 .. 0xAF: variable bytes with lengths 0 through 47.
    • followed by this many number of bytes for the actual data.
  • 0xB0 .. 0xB4: single scalar of bit sizes 8, 16, 32, 64, and 128 (little endian).
    • these can be interpreted as signed or unsigned integers or floating point numbers.
  • 0xB5 .. 0xB7: reserved.
  • 0xB8: nil or null
  • 0xB9: varint
  • 0xBA: variable bytes
    • followed by a varint for the length.
    • followed by this many number of bytes for the actual data.
  • 0xBB .. 0xBF: reserved.
There are still plenty of reserved type tags for future use, though I've left out some important ones such as start and end of a list or a map, but opted to wrap them inside variable length data for lazy decoding. This means unlike MsgPack where the schema can be discerned from the encoding, the encoding here requires a separate schema in order to be fully interpreted.

A top level message is encoded like a map with the sequence \( k_1, v_1, k_2, v_2, ... \) but all the keys must represent integers. A map, on the other hand, can have non-integer key types. A list is simply a sequence of values \( v_1, v_2, ... \). Nested messages, maps, and lists are encapsulated inside variable length data like strings.

All variable length data store the bytes contiguously so the memory region can be referenced directly, for zero-copy decoding. I've decided to forego chunked encoding for now, although chunks can also be referenced in memory as a "cord" as opposed to a string.

Note that it is also possible to implement zero-copy and lazy decoding for the original protobuf wire encoding as well, but I'm not aware of any implementation that does that.

Here is an alternative encoding that restores the ability to discern schema from encoding but preserve the ability to do lazy decoding.
  • 0x80 .. 0x8E, 0x8F: byte string with lengths 0 through 14, and arbitrary length.
  • 0x90 .. 0x9E, 0x9F: lists with encoded byte lengths 0 through 14, and arbitrary length.
  • 0xA0 .. 0xAE, 0xAF: maps or nested messages, same deal.
  • 0xB0 .. 0xBF: like the encoding before.
The idea is that we represent byte strings, lists, and maps / nested messages in separate ranges. The lengths 0 through 14 are encoded directly in the type tag, and the 15 means the length follows the tag as a varint.

The problem still remains that we can't do one-pass zero-copy encoding of variable length data and come back to fix the length, since we might not have reserved enough space for the length.

Saturday, March 12, 2016

How to Play Go Like a Computer Scientist

The board game of Go, or 圍棋 (weiqi), started receiving a lot of attention when AlphaGo beat the European champion Fan Hui in all of the five games. Now people are worried when AlphaGo is on its way to beat the world champion Lee Sedol, winning 3 out of 5 games so far. AlphaGo plays the game like no human as seen the game played before, inciting the awe of sadness and beauty. It is touted as an accomplishment of machine intelligence because traditionally the game of Go is very difficult for a machine to play well, whereas games like Chess is well-understood.

Board games are fascinating because a few number of rules allow an exponential number of possible game moves, some with winning outcomes, and some with losing. Traditionally, a naive game solver for any game tries all the possible positions of each alternate side's game play some number of steps ahead and chooses the first move that leads to the best gameplay after that number of steps, known as the min-max algorithm. Pruning heuristics is applied to reduce the search space so the computer does not waste time searching down "lost cause" paths, known as alpha-beta pruning. Even with pruning, the search space is still exponential to the number of positions on the board.

The keyword here is exponential. The machine can spend more time to compute, but for the amount of time spent \( t \), it can only search by \( \log{t} \) more steps. We can attempt to parallelize exponential search by enlisting the help of some number of machines, but it will still only allow us to advance the search depth by \( \log{n} \), i.e. the added intelligence is logarithmic to the number of machines.

Chess is well-understood. The number of game play combinations are vast, but we have found ways to prune the search space so that it is tractable with computer clusters that are practical to build. Computers can beat human playing Go on a smaller \( 9 \times 9 \) board with an upper bound of \( (9 \times 9)! \) or \( 10^{121} \) possible game plays, but on a professional championship level, the board size \( 19 \times 19 \) sports an exponential explosion of an upper bound of \( (19 \times 19)! \) or approximately \( 10^{768} \) possible game plays.

Even though Google might undertake the most ambitious of data center builds, they couldn't possibly shut down all the kitten videos on YouTube so AlphaGo could play against Lee Sedol with super-intelligence. Even so, they would achieve only logarithmic increase in intelligence. Building machines to brute force an exponential algorithm is a sure way to go bankrupt (NSA is an exception because the government prints the money).

I feel that a disclaimer is in order before I write any further. I don't have insider knowledge about how AlphaGo works. Everything I'm about to speculate comes from common sense and public sources.

AlphaGo algorithm reduces the search space using Monte-Carlo tree search, which means it randomly prunes some moves to reduce the branching factor, which allows the search to go deeper. The pruning is guided by deep neural network so it decreases the probability of pruning winning moves. Deep neural network works by building a hierarchy of signal processing nodes that amplifies certain signals after training.

When using DNN to recognize pictures, we can visualize how it works by looking at the images synthesized iteratively in order to boost certain signals.

In computer science, a pervasive strategy to design a super-efficient algorithm is by divide and conquer: basically, a big problem is solved by breaking down the problem into smaller sub-problems, until the problem becomes trivially solvable. Here are some examples:
  • Sorting is taught to every first year computer science undergraduate. Given \( n \) items to sort, a naive sorting algorithm like bubble sort will take \( O(n^2) \) steps, whereas better ones using divide and conquer like quick sort or merge sort will take \( O(n \lg{n}) \) steps. If we are to sort 1000 items, this makes the difference between 1 million steps verses just 10000 steps.
  • First year graduate student would likely learn about discrete Fourier transform. A naive DFT takes \( O(n^2) \) steps, whereas fast Fourier transform using divide and conquer takes \( O(n \lg{n}) \) steps.
  • A naive matrix multiplication takes \( O(n^3) \) steps for an \( n \times n \) matrix, but Strassen's algorithm using divide and conquer takes \( O(n^{\lg{7}}) \) or approximately \( O(n^{2.8}) \) steps.
  • The n-th Fibonacci number can be computed in \( O(\lg{n}) \) time using a divide and conquer method. The straightforward iterative method would take \( O(n) \) time, and a naive recursive method would take \( O(2^n) \) time. See Fibonacci Number Program for some examples in code.
For other algorithms, divide and conquer may not reduce the number of steps, but it still improves computation performance due to better cache locality. Divide and conquer is the basic idea of cache oblivious algorithms.

In other words, the real innovation of AlphaGo is not because machines achieved singularity, but because deep neural network approximated the divide and conquer method which happens to allow Go to be played efficiently.

The outcome of Go is scored by determining whether the pieces can survive locally, and the territorial occupation by the surviving pieces. This makes the winning strategy two-fold: on one hand, a player needs to ensure local survival of the pieces, but he has to take the territory into account. If a local battle is losing, it could be better for a player to strengthen his position on a different part of the board. This division of local and global thinking lends itself very well to using the divide and conquer method.

Here are some ideas how to divide. On the first level, we divide the board into subregions of \( 10 \times 10 \) each (figure left). We can also add overlapping subregions of \( 9 \times 9 \) or \( 9 \times 10 \) (figure right). This results in the first level division into \( 3 \times 3 \) subregions.

Left: Level 1
Right: Level 1 overlaps
Each subregion can be divided further into \( 5 \times 5 \) or \( 5 \times 4 \) grids with overlaps. In the following figure, only the division of the first subregion is shown, but it should be understood that the same division will take place for all first level subregions.

Left: Level 2
Right: Level 2 overlaps
A sketch of the divide and conquer algorithm would play Go by first choosing the region to play by valuation of the amount of territory it can likely occupy. Then within each region, it could brute force the positions that maximizes the survival of its own pieces.

Here is a very rough guesstimate of the number of steps needed to solve the game. These moves can be grouped into streaks of local battles taking place in \( 3 \times 3 \) subregions, and each streak is further divided to settle the \( 3 \times 3 \) sub-subregions which are about \( 5 \times 5 \) by grid. This reduces the amount of search space to be just:
\[ (5 \times 5)! \cdot (3 \times 3)! \cdot (3 \times 3)! \approx 10^{36} \]
This is a lot more computationally tractable. Of course, the caveat of my calculations is that I have never implemented a divide and conquer player for Go, and I don't know how well it works in practice. But even if we have to add a few levels of \( 3 \times 3 \) division, multiplying the search space by \( 9! \) still grows a lot slower than the exponential explosion of \( (9 \times 9)! \).
\[ (5 \times 5)! \cdot (9!)^9 \approx 10^{75} \]
Using this divide and conquer method, we can envision a game of Go played on a much bigger board, maybe like the 3-dimensional chess depicted in Star Trek.

The difference between the world champion Go player such as Lee Sedol and a modest Go player who happens to be a computer scientist, is that a computer scientist understands the power of divide and conquer. There is an immense beauty of divide and conquer that reaches far beyond game play. Nobody is worried when a computer doing quick sort can sort numbers faster than a human can with no sorting strategy, so why should we worry that a computer using divide and conquer can play a board game better than humans with only an ad-hoc strategy?

No, the computer is not about to achieve singularity. It is just a tool that helped us understand the game better.