Tuesday, December 13, 2016

Fermat Numbers and Bit Mask

There is a little known application of Fermat numbers for generating bit masks, that I discovered back in 2010 while implementing "counting the number of one bits" using C++ template meta-programming. In the algorithm, bit masks such as 0x55555555, 0x33333333, 0x0f0f0f0f, etc. are used to mask out bits in specific locations. These bit masks can be generated by dividing the maximum integer 0xffffffff (or simply ~0u) against Fermat numbers 3, 5, 17, 257, 65537, etc.

0x5555...0101010101010101...~0u / 3
0x3333...0011001100110011...~0u / 5
0x0f0f...0000111100001111...~0u / 17
0x00ff...0000000011111111...~0u / 257
and so on...

It's not a coincidence that integers with all one bits set can be divided cleanly by a Fermat number, and this works with 8-bit, 16-bit, 32-bit, 64-bit, and beyond. To understand why, a \(k\)-bit all-one integer is \(2^k - 1\). In particular, \(k = 2^n\) in our case. We can factor \(2^{2^n} - 1\) using the formula \(a^2 - b^2 = (a + b)(a - b)\) we learned in grade school, here using the special case \(b = 1\), so \(a^2 - 1 = (a + 1)(a - 1)\).
2^{2^n} - 1 & = (2^{2^{n - 1}} + 1)(2^{2^{n - 1}} - 1) \\
 & = (2^{2^{n - 1}} + 1)(2^{2^{n - 2}} + 1)(2^{2^{n - 2}} - 1) \\
 & = (2^{2^{n - 1}} + 1)(2^{2^{n - 2}} + 1)(2^{2^{n - 3}} + 1)(2^{2^{n - 3}} - 1) \\
 & = \dots
Conveniently, the last term \(2^{2^0} - 1\) is just \(1\), so we can write the product in terms of \(F_i = 2^{2^i} + 1\) for the \(i\)-th Fermat number,
\[ 2^{2^n} - 1 = \prod_{i=0\dots n-1}{2^{2^i} + 1} = \prod_{i=0\dots n-1}{F_i} \]
For example,
\(k\)\(2^k - 1\)Expression
23\(3 \)
415\(3 \times 5 \)
8255\(3 \times 5 \times 17 \)
1665535\(3 \times 5 \times 17 \times 257\)
324294967295\(3 \times 5 \times 17 \times 257 \times 65537\)
and so on...

So how are Fermat numbers related to bit mask construction? It turns out that if you write 3, 5, 17, 257, 65537 in binary, you get 11, 101, 1 0001, 1 0000 0001, and 1 0000 0000 0000 0000 0000 0000 0000 0001 respectively. If you multiply 101 (5) and 1 0001 (17), you get alternating patterns of 0101 0101 (85). If you multiply this by 11 (3), you get an 8-bit all-one integer 1111 1111 (255). Conversely, you can take a pattern "out of circulation" by dividing the all-one integer by a Fermat number which is one of its factors.

Appendix: Counting the number of one bits using C++ template meta-programming

We start with the straightforward bit vector algorithm:
x = ((x >> 1) & 0x55555555) + (x & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) & 0x0f0f0f0f) + (x & 0x0f0f0f0f);
// etc.
which can be represented by the following for-loop:
for (int n = 1; n < sizeof(Tp) * 8; n *= 2)
  Tp mask = ~0 / ((1 << n) + 1));
  x = ((x >> n) & mask) + (x & mask);
In the template meta-programming version, the template __unroll_shift_mask_add essentially unrolls into the for-loop above, but in a manner that is agnostic to the actual integer type.
template<unsigned int n, typename Tp>
struct __unroll_shift_mask_add {
  static Tp compute(Tp x) {
    Tp y = __unroll_shift_mask_add<n / 2, Tp>::compute(x);
    Tp mask = ~Tp(0) / ((Tp(1) << n) + 1);
    return ((y >> n) & mask) + (y & mask);

template<typename Tp>
struct __unroll_shift_mask_add<0u, Tp> {
  static Tp compute(Tp x) { return x; }

template<typename Tp>
Tp one_bits(Tp x) {
  return __unroll_shift_mask_add<sizeof(Tp) * 8 / 2, Tp>::compute(x);

Saturday, November 26, 2016

My Setup with OpenZFS on Mac OS X

My new OpenZFS on Mac OS X setup in my cabinet, with raidz over 4x 1TB SSD on two AC powered USB 3.0 hubs. I ran into these gotchas:
  • The USB 3.0 UASP adapter I'm using sometimes changes the disk serial number, which causes zpool difficulty finding the disk after unplug or reboot. Fix: do a zpool export followed by zpool import immediately after pool creation renames the disks in vdev to use media ID, which is more robust. The media IDs are created the same time the disks are added to the pool and automatically partitioned and formatted (with the GPT scheme), so you can't create a pool with disks referenced by media IDs.
  • If a disk is lost due to serial number change, the only way to get it back is by creating the missing symlink under /var/run/disk/by-serial manually and export then import the pool again. ZFS does not allow replacing a disk that is still part of a pool.
  • When running zpool import SomePool with umask 0077, the parent ".." directory of /Volume/SomePool becomes accessible only to root and permission denied for all other users, which causes the Finder to crash when creating folders. The Finder also displays existing directories (created in Terminal) as a blank document with no content. Just export and rerun zpool import after setting umask 0022.
  • ZFS by default is case sensitive. Some application bundles are built incorrectly, which causes a variety of errors when I open the application such as "You can't open the application _______ because it may be damaged or incomplete" or "The application cannot be opened because its executable is missing." I ended up creating a dedicated Applications filesystem under the pool with casesensitivity=insensitive option (which can only be set when creating a filesystem).
  • Remember to set zfs set com.apple.mimic_hfs=on if you're planning to store Photos library on it.

Update on November 27:

I'm still experimenting with this setup, and in the process managed to toast an older 512GB SSD because one of my older USB 3.0 to SATA adapters had a bad contact, causing rapid power losses that fried the NAND. Trying to read/write problematic blocks would cause the drive to stall. But I managed to wipe it (under Mac OS X) by using VirtualBox running Ubuntu. Make sure the VM setting supplies USB 3.0 devices, or you'll get VERR_PDB_NO_USB_PORTS when attaching the USB/SATA adapter to the VM. Assuming the device shows up in the Linux guest as /dev/sdb (please double check, as this will be destructive), run:
apt install sg3-utils
num_blocks=...  # get this from /proc/partitions
sg_unmap --lba=0 --num=${num_blocks} /dev/sdb
After this, the SSD is empty and in a clean slate, and I could then format it.

Initially it seemed to be a good idea to use USB 3.0 hub and a bunch of USB 3.0 to SATA adapters to accomplish the disk array because it's much cheaper than an enclosure, but I've had many problems with it. The USB 3.0 hub sometimes incorrectly recognizes some adapters as USB 2.0 only, and this limits the speed of the whole array to USB 2.0 speed. I've also experienced flaky ports causing checksum errors. There are too many components that could go wrong and did go wrong, so I'm now getting a proper RAID enclosure as I should have in the first place.

Update on November 28:

The drive died again after writing some odd GBs of data on it. It seems that TRIM limits the number of blocks it can operate on at once, so I didn't actually wipe the whole disk. I also tried ATA secure erase but the drive died again, so it didn't seem to free up the blocks. I concocted a script:
for ((i = 0; i < ${num_blocks}; i += 8)); do
  while [[ ! -b /dev/sdb ]]; do echo; read -p 'waiting for disk...'; done
  printf '%d' $i
  sg_unmap --lba=$i --num=8 /dev/sdb
  printf '\r'
Even though it stalls from time to time requiring manual intervention, it does seem that the more blocks I unmapped, the more I could write, so I'm hopeful.

Update on November 30:

It turned out the aforementioned SSD drive didn't die. It's due to another flaky USB/SATA adapter. I replaced it with another, and the drive is working merrily. In case anyone's wondering, those IOcrest adapters seem to have a high failure rate. In any case, my USB woes are solved altogether with the Akitio Thunder2 Mini.

Sunday, July 10, 2016

C preprocessor hygienic macros

A hygienic macro is a macro that defines variables for its own use without accidentally confusing them with the variables defined by the user of the macro. By design, C/C++ preprocessor macro performs only simple string substitution which is not hygienic. That means those using a macro has to be mindful of the macro's definition and avoid using variables of already used names. This also means that such macro cannot be arbitrarily nested in scope, since nesting will cause variable conflict.

But hygienic macro is really useful for defining language extensions in the form of syntactic sugars. My motivation example is a foreach loop that was not added to the C++ language until C++11. Although C++ now has a ranged-for syntax, it is still useful for a plain C library that implements containers. There may be other use cases for macro hygiene, so I want to explain the technique I have been using for a number of years.

The "ranged-for" syntax in C++11 iterates over all items in a container sequentially:
std::vector<int> xs;
for (int& x : xs) {
  // User code.
I defined a macro to accomplish something similar:
std::vector<int> xs;
foreach(int& x, xs) {
  // User code.
The idea is that this macro is a synthetic sugar that expands to a for-loop using an iterator:
std::vector<int> xs;
for (std::vector<int>::iterator it = xs.begin(); it != xs.end(); ++it) {
  int& x = *it;
  // User code.
But this synthetic sugar does not quite work: we need to define the variable int& x on the user's behalf. The user shouldn't be expected to add int& x = *it; to his block statement. For that, I used another for loop to introduce the variable but only run it once.
std::vector<int> xs;
for (std::vector<int>::iterator it = xs.begin(); it != xs.end(); ++it)
  for (int& x = *it; true; break) {
    // User code.
What if the user code also contains a break statement? It will only break the inner for loop that introduces the variable binding but the outer for loop will continue. Unfortunately C/C++ has no labeled break. One solution is to use a variable to keep track of how the break happened.
std::vector<int> xs;
for (bool next = true; next; (next = false))
  for (std::vector<int>::iterator it = xs.begin(); next && it != xs.end(); ++it)
    for (int& x = *it; !(next = false); ({ next = true; break; })) {
      // User code.
We wrap the user code around a sentinel variable initially set to false before executing the user code, and reset it to true after. If the user code breaks, the sentinel variable remains false, so the next iteration will not run. This works, but now the code introduces two implicit variables, it and next.

The simplest way to construct a hygienic macro is by giving it explicit variable names, so we can define the foreach macro like this:
#define FOR_EACH_NAMED(decl_var, container, it, next)                   \
  for (bool next = true; next; (next = false))                          \
    for (autovar(it, container.begin()); next && it != container.end(); ++it) \
      for (decl_var = *it; !(next = false); ({ next = true; break; }))
Instead of writing the explicit iterator type std::vector<int>::iterator, we use another macro autovar() here to declare a variable with a type that is inferred from an expression, using a GCC extension __typeof__(). C++11 now has decltype which should be used for C++ code.
#define autovar(var, exp) __typeof__(exp) var = (exp)
Now, the hygienic macro simply generates unique variable names and invokes the explicit macro.
#define foreach FOR_EACH

#define FOR_EACH(decl_var, container) \
  FOR_EACH_NAMED(decl_var, container, VAR(__it_), VAR(__next_))
Now the interesting part is the VAR() macro that creates a unique variable name using a given prefix. Using a human readable prefix makes the generated code somewhat easier to debug. We define a UNIQUE() macro to generate a unique token, and this needs to be concatenated with the prefix only after evaluating the unique token, not before. For this to work, we use a JOIN() macro that does double concatenation.
// Creates a hygienic identifier with a given prefix name.
#define VAR(name) JOIN(name, UNIQUE())

// Concatenates expanded macro.
#define JOIN(a, b) JOIN_2(a, b)
#define JOIN_2(a, b) a ## b
As to the implementation of UNIQUE(), we use __COUNTER__, which is a GCC extension, if present. Otherwise the current line number __LINE__ would suffice.
#ifdef __COUNTER__
#  define UNIQUE() __COUNTER__
#  define UNIQUE() __LINE__
In summary, a good way to construct hygienic C preprocessor macro is by first defining an explicit version, then an implicit hygienic version using a unique variable generator.
#define FOO_NAMED(arg0, arg1, /* ... */, var0, var1, /* ... */) /* ... */
#define FOO(arg0, arg1, /* ... */) \
  FOO_NAMED(arg0, arg1, /* ... */, VAR(__var0_), VAR(__var1_), /* ... */)

Friday, May 13, 2016

Proof of Sum of Exponential Powers

My renewed interest in proof of exponent laws came from reading the Harvard Entrance Exam, 1869, which David Robert Palmer painstakingly typed up. In the exam there were sections of Latin and Greek, History and Geography, are various mathematics. It's fascinating that the words in Latin and Greek are already provided, so knowing how to conjugate is all that is required. The maths section are heavily calculation based. It seems that these sections serve to test only how nimble the minds are in solving known problems, but not so much whether the students exhibit the thought process of solving unknown problems. But one question intrigued me.

Question (3) in the algebra section reads, "What is the reason that when different powers of the same quantity are multiplied together their exponents are added?" It is asking about the exponent law for product of powers, \( a^x a^y = a^{x+y} \). Knowing the prestige of Harvard, giving an informal answer is likely not satisfactory. But most of the so called "proofs" that are prevalent in textbooks essentially boil down to the following reasoning.
\[ \underbrace{a a \dots a }_{x \text{ times}} \cdot \underbrace{a a \dots a}_{y \text{ times}} = \underbrace{a a a a \dots a a }_{x + y \text{ times}} \]
This appeals to the intuitive understanding that \( a^x \) is \(a\) multiplied \(x\) times, so when you multiply the result also by \(a\) for another \(y\) times, you get \( a^{x+y} \). This is hardly satisfying as a proof.

Another "proof" is to reduce this to the additive law of logarithms \( \log{xy} = \log{x} + \log{y} \), but this is like throwing the hot potato to someone else.
\[ \log{a^x a^y} = \log{a^x} + \log{a^y} = x + y \\
a^{\log{a^x a^y}} = a^{x+y} \]
And since \( a^{\log{c}} = c \) i.e. log is the inverse of exponent, so \( a^{\log{a^x a^y}} = a^x a^y = a^{x+y} \).

There are other exponent laws without formal proofs, but for this article I'm going to focus on the sum of powers.

Formal proof

We appeal to proof by induction for the formal proof. First we establish the base case for exponentiation:
\[ a^0 = 1 \]
Also recall that \(1\) is the multiplicative identity, so \( x\cdot 1 = 1\cdot x = x \), and \(0\) is the additive identity, so \( x+0 = 0+x = x \). From this, we can prove the base case where \( y = 0 \), i.e. \( a^x a^0 = a^x \cdot 1 = a^x = a^{x+0} \).

Now for the inductive case, we need both the axiom of exponentiation \( a^x \cdot a = a^{x+1} \) and the induction hypothesis \( a^x a^y = a^{x+y} \). We proceed to show that \( a^x a^{y+1} = a^{x+y+1} \).
\[ a^x (a^{y+1}) \underbrace{=}_{\text{axi.}} a^x (a^y a^1) \underbrace{=}_{\text{assoc.}} (a^x a^y) a^1 \underbrace{=}_{\text{I.H.}} a^{x+y} a^1 \underbrace{=}_{\text{axi.}} a^{(x+y)+1} \]
In the equations above, "axi." means the axiom of exponentiation, "I.H." means application of induction hypothesis, and "assoc." means associativity (which is true of all multiplicative groups, e.g. matrix multiplication).

Having painstakingly established a formal proof of the product of powers, there is a notable limitation, namely that we only allow natural number exponents, \( \{x, y\} \subseteq \mathbb{N} \). In particular, we don't allow negative (e.g. -3) or fractions (e.g. 3.14159...). Indeed, the real number domain has declarative properties that evade the mere application of modus ponens in discrete logic, which gives rise to the famous satire Alice in Wonderland.

Proof sketch in real number domain

Fortunately, all is not lost in the real number domain. We can invoke Taylor's theorem to expand the infinite series for calculating \( e^x \) and \( e^y \), multiply the two series, and see if \( e^{x+y} \) expands to the result of the multiplication.

The Taylor series for \( e^x \) is:
e^x & = & \sum^{\infty}_{n=0} {x^n \over n!} \\
 & = & 1 + {x^1 \over 1!} + {x^2 \over 2!} + {x^3 \over 3!} + {x^4 \over 4!} + \dots \\
 & = & 1 + x + {x^2 \over 2} + {x^3 \over 6} + {x^4 \over 24} + \dots
The following table multiplies the Taylor series for \( e^x \) and \( e^y \).
\times & 1 & x & x^2 \over 2 & x^3 \over 6 & x^4 \over 24 & \dots \\ \hline
1 & 1 & x & x^2 \over 2 & x^3 \over 6 & x^4 \over 24 & \dots \\
y & y & xy & x^2 y \over 2 & x^3 y \over 6 & x^4 y \over 24 & \dots \\
y^2 \over 2 & y^2 \over 2 & xy^2 \over 2 & x^2 y^2 \over 4 & x^3 y^2 \over 12 & x^4 y^2 \over 48 & \dots \\
y^3 \over 6 & y^3 \over 6 & xy^3 \over 6 & x^2 y^3 \over 12 & x^3 y^3 \over 36 & x^4 y^3 \over 144 & \dots \\
y^4 \over 24 & y^4 \over 24 & xy^4 \over 24 & x^2 y^4 \over 48 & x^3 y^4 \over 144 & x^4 y^4 \over 576 & \dots \\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots
The Taylor series for \( e^{x + y} \), expanded individually, is:
1 &+& (x+y) &+& {(x+y)^2 \over 2} &+& {(x+y)^3 \over 6} &+& {(x+y)^4 \over 24} &+& \dots \\
 & & || & & || & & || & & || \\
 & & x & & x^2 \over 2 & & x^3 \over 6 & & x^4 \over 24 \\
 & & y & & xy & & x^2 y \over 2 & & x^3 y \over 6 \\
 & & & & y^2 \over 2 & & xy^2 \over 2 & & x^2 y^2 \over 4 \\
 & & & & & & y^3 \over 6 & & xy^3 \over 6 \\
 & & & & & & & & y^4 \over 24
As you can see, the expansion of each term corresponds diagonally to the multiplication table above, so we can indeed show by Taylor expansion that \( e^x e^y = e^{x+y} \). This is only a proof sketch, but the enterprising mind could organize the table into a formal proof, using Binomial Theorem and Cauchy Product.

You might have noticed that we have only established the product law at base \(e\) but not other bases. For the other bases, recall that \( a^x = e^{x \ln a} \), so we have:
\[ a^x a^y = e^{x \ln a} e^{y \ln a} = e^{(x \ln a) + (y \ln a)} = e^{(x+y)\ln a} = a^{x+y} \]


There are two reasons why I suspect the Harvard entrance exam only expected the informal reasoning of counting the number of multiplicands. Cauchy (1789-1857) product is a fairly recent development to the time of the exam in 1869, which makes it unlikely to become part of the curriculum. There were no other questions that require the use of proof by induction so that it is also unlikely to be part of the curriculum either.

On the other hand, it is only after I became older (and overeducated) that I realize a lot of maths I was taught in grade school were really just informal proof sketches, and even so I didn't learn to question the profoundness of their limitations. The informal proof sketch of counting the multiplicands is limited to exponent in the natural number domain and would not work for real number exponents. I spent the better parts of my youth learning faux math!

Laws in the real number domain cannot be proven by counting alone because real numbers are uncountable. For this case I appealed to Taylor's Theorem, first expanding the polynomial series and showing the combined series are equal. I suspect many of the real number laws could use the same treatment.

Thursday, March 17, 2016

Protobuf like wire encoding inspired by MsgPack

I had previously mused about binary JSON encoding and contemplated unifying it with flat memory structures inspired by Cap'n Proto. It turns out that the most interesting uses of proto structures are the optional and repeated fields as well as other variable length data, so much that proto3 has eliminated required fields but added support for maps. Flat memory layout does not present variable length data well, and even so, some encoding and decoding are still necessary.

Hence I've decided to take another look, this time applying some ideas of MsgPack to encode protobuf wire data. The traditional protobuf encoding allows 8 distinct type tags (4 used, 2 deprecated, 2 more are reserved). The used ones are varint (0), 64-bit scalar (1), length-delimited (2), and 32-bit scalar (5).

Each field is packed as (field_num << 3) | type_tag as a varint key followed by whatever the value is indicated by the type tag. The largest field number that can be represented in a single byte varint is 15 (largest single digit base128 is 127, divided by 8 for 3 bits used by the type tag). In practice, moderate sized proto messages would need 2 bytes for the field keys.

Again, drawing from the ideas behind MsgPack, a single-byte can be appropriated for a range of small integers and moderately sized type tags that can encode short lengths. Decoupling the field number and type tag makes it convenient to express maps very efficiently, but many messages can still be encoded in the same number of bytes or fewer.

To rehash, here are the ranges for small integers.

  • 0x00 .. 0x7F: zero and positive integers 0 through 127.
  • 0xC0 .. 0xFF: negative integers -64 through -1.
We allocate the range 0x80 .. 0xBF as follows:
  • 0x80 .. 0xAF: variable bytes with lengths 0 through 47.
    • followed by this many number of bytes for the actual data.
  • 0xB0 .. 0xB4: single scalar of bit sizes 8, 16, 32, 64, and 128 (little endian).
    • these can be interpreted as signed or unsigned integers or floating point numbers.
  • 0xB5 .. 0xB7: reserved.
  • 0xB8: nil or null
  • 0xB9: varint
  • 0xBA: variable bytes
    • followed by a varint for the length.
    • followed by this many number of bytes for the actual data.
  • 0xBB .. 0xBF: reserved.
There are still plenty of reserved type tags for future use, though I've left out some important ones such as start and end of a list or a map, but opted to wrap them inside variable length data for lazy decoding. This means unlike MsgPack where the schema can be discerned from the encoding, the encoding here requires a separate schema in order to be fully interpreted.

A top level message is encoded like a map with the sequence \( k_1, v_1, k_2, v_2, ... \) but all the keys must represent integers. A map, on the other hand, can have non-integer key types. A list is simply a sequence of values \( v_1, v_2, ... \). Nested messages, maps, and lists are encapsulated inside variable length data like strings.

All variable length data store the bytes contiguously so the memory region can be referenced directly, for zero-copy decoding. I've decided to forego chunked encoding for now, although chunks can also be referenced in memory as a "cord" as opposed to a string.

Note that it is also possible to implement zero-copy and lazy decoding for the original protobuf wire encoding as well, but I'm not aware of any implementation that does that.

Here is an alternative encoding that restores the ability to discern schema from encoding but preserve the ability to do lazy decoding.
  • 0x80 .. 0x8E, 0x8F: byte string with lengths 0 through 14, and arbitrary length.
  • 0x90 .. 0x9E, 0x9F: lists with encoded byte lengths 0 through 14, and arbitrary length.
  • 0xA0 .. 0xAE, 0xAF: maps or nested messages, same deal.
  • 0xB0 .. 0xBF: like the encoding before.
The idea is that we represent byte strings, lists, and maps / nested messages in separate ranges. The lengths 0 through 14 are encoded directly in the type tag, and the 15 means the length follows the tag as a varint.

The problem still remains that we can't do one-pass zero-copy encoding of variable length data and come back to fix the length, since we might not have reserved enough space for the length.

Saturday, March 12, 2016

How to Play Go Like a Computer Scientist

The board game of Go, or 圍棋 (weiqi), started receiving a lot of attention when AlphaGo beat the European champion Fan Hui in all of the five games. Now people are worried when AlphaGo is on its way to beat the world champion Lee Sedol, winning 3 out of 5 games so far. AlphaGo plays the game like no human as seen the game played before, inciting the awe of sadness and beauty. It is touted as an accomplishment of machine intelligence because traditionally the game of Go is very difficult for a machine to play well, whereas games like Chess is well-understood.

Board games are fascinating because a few number of rules allow an exponential number of possible game moves, some with winning outcomes, and some with losing. Traditionally, a naive game solver for any game tries all the possible positions of each alternate side's game play some number of steps ahead and chooses the first move that leads to the best gameplay after that number of steps, known as the min-max algorithm. Pruning heuristics is applied to reduce the search space so the computer does not waste time searching down "lost cause" paths, known as alpha-beta pruning. Even with pruning, the search space is still exponential to the number of positions on the board.

The keyword here is exponential. The machine can spend more time to compute, but for the amount of time spent \( t \), it can only search by \( \log{t} \) more steps. We can attempt to parallelize exponential search by enlisting the help of some number of machines, but it will still only allow us to advance the search depth by \( \log{n} \), i.e. the added intelligence is logarithmic to the number of machines.

Chess is well-understood. The number of game play combinations are vast, but we have found ways to prune the search space so that it is tractable with computer clusters that are practical to build. Computers can beat human playing Go on a smaller \( 9 \times 9 \) board with an upper bound of \( (9 \times 9)! \) or \( 10^{121} \) possible game plays, but on a professional championship level, the board size \( 19 \times 19 \) sports an exponential explosion of an upper bound of \( (19 \times 19)! \) or approximately \( 10^{768} \) possible game plays.

Even though Google might undertake the most ambitious of data center builds, they couldn't possibly shut down all the kitten videos on YouTube so AlphaGo could play against Lee Sedol with super-intelligence. Even so, they would achieve only logarithmic increase in intelligence. Building machines to brute force an exponential algorithm is a sure way to go bankrupt (NSA is an exception because the government prints the money).

I feel that a disclaimer is in order before I write any further. I don't have insider knowledge about how AlphaGo works. Everything I'm about to speculate comes from common sense and public sources.

AlphaGo algorithm reduces the search space using Monte-Carlo tree search, which means it randomly prunes some moves to reduce the branching factor, which allows the search to go deeper. The pruning is guided by deep neural network so it decreases the probability of pruning winning moves. Deep neural network works by building a hierarchy of signal processing nodes that amplifies certain signals after training.

When using DNN to recognize pictures, we can visualize how it works by looking at the images synthesized iteratively in order to boost certain signals.

In computer science, a pervasive strategy to design a super-efficient algorithm is by divide and conquer: basically, a big problem is solved by breaking down the problem into smaller sub-problems, until the problem becomes trivially solvable. Here are some examples:
  • Sorting is taught to every first year computer science undergraduate. Given \( n \) items to sort, a naive sorting algorithm like bubble sort will take \( O(n^2) \) steps, whereas better ones using divide and conquer like quick sort or merge sort will take \( O(n \lg{n}) \) steps. If we are to sort 1000 items, this makes the difference between 1 million steps verses just 10000 steps.
  • First year graduate student would likely learn about discrete Fourier transform. A naive DFT takes \( O(n^2) \) steps, whereas fast Fourier transform using divide and conquer takes \( O(n \lg{n}) \) steps.
  • A naive matrix multiplication takes \( O(n^3) \) steps for an \( n \times n \) matrix, but Strassen's algorithm using divide and conquer takes \( O(n^{\lg{7}}) \) or approximately \( O(n^{2.8}) \) steps.
  • The n-th Fibonacci number can be computed in \( O(\lg{n}) \) time using a divide and conquer method. The straightforward iterative method would take \( O(n) \) time, and a naive recursive method would take \( O(2^n) \) time. See Fibonacci Number Program for some examples in code.
For other algorithms, divide and conquer may not reduce the number of steps, but it still improves computation performance due to better cache locality. Divide and conquer is the basic idea of cache oblivious algorithms.

In other words, the real innovation of AlphaGo is not because machines achieved singularity, but because deep neural network approximated the divide and conquer method which happens to allow Go to be played efficiently.

The outcome of Go is scored by determining whether the pieces can survive locally, and the territorial occupation by the surviving pieces. This makes the winning strategy two-fold: on one hand, a player needs to ensure local survival of the pieces, but he has to take the territory into account. If a local battle is losing, it could be better for a player to strengthen his position on a different part of the board. This division of local and global thinking lends itself very well to using the divide and conquer method.

Here are some ideas how to divide. On the first level, we divide the board into subregions of \( 10 \times 10 \) each (figure left). We can also add overlapping subregions of \( 9 \times 9 \) or \( 9 \times 10 \) (figure right). This results in the first level division into \( 3 \times 3 \) subregions.

Left: Level 1
Right: Level 1 overlaps
Each subregion can be divided further into \( 5 \times 5 \) or \( 5 \times 4 \) grids with overlaps. In the following figure, only the division of the first subregion is shown, but it should be understood that the same division will take place for all first level subregions.

Left: Level 2
Right: Level 2 overlaps
A sketch of the divide and conquer algorithm would play Go by first choosing the region to play by valuation of the amount of territory it can likely occupy. Then within each region, it could brute force the positions that maximizes the survival of its own pieces.

Here is a very rough guesstimate of the number of steps needed to solve the game. These moves can be grouped into streaks of local battles taking place in \( 3 \times 3 \) subregions, and each streak is further divided to settle the \( 3 \times 3 \) sub-subregions which are about \( 5 \times 5 \) by grid. This reduces the amount of search space to be just:
\[ (5 \times 5)! \cdot (3 \times 3)! \cdot (3 \times 3)! \approx 10^{36} \]
This is a lot more computationally tractable. Of course, the caveat of my calculations is that I have never implemented a divide and conquer player for Go, and I don't know how well it works in practice. But even if we have to add a few levels of \( 3 \times 3 \) division, multiplying the search space by \( 9! \) still grows a lot slower than the exponential explosion of \( (9 \times 9)! \).
\[ (5 \times 5)! \cdot (9!)^9 \approx 10^{75} \]
Using this divide and conquer method, we can envision a game of Go played on a much bigger board, maybe like the 3-dimensional chess depicted in Star Trek.

The difference between the world champion Go player such as Lee Sedol and a modest Go player who happens to be a computer scientist, is that a computer scientist understands the power of divide and conquer. There is an immense beauty of divide and conquer that reaches far beyond game play. Nobody is worried when a computer doing quick sort can sort numbers faster than a human can with no sorting strategy, so why should we worry that a computer using divide and conquer can play a board game better than humans with only an ad-hoc strategy?

No, the computer is not about to achieve singularity. It is just a tool that helped us understand the game better.

Saturday, February 6, 2016

Generic setuid/setgid wrapper for scripts

The setuid or setgid bit on a binary executable allows the binary to run with the effective uid or gid set to the owner of the executable, rather than the user that runs it. It is a rudimentary way for anyone to impersonate as the executable owner, so the owner can provide limited services using the owner's identity, including using files only accessible by the owner.

Historically, system binaries such as /sbin/ping are setuid root since ping requires privileged network access, but capabilities have reduced those needs. Binaries that manipulate privileged files still need to be setuid root, such as /bin/passwd. To setuid non-root user is less common but is still useful to control filesystem access.

Sometimes it is useful to write a shell script and making it setuid or setgid for the sake of controlling filesystem access. Many experts generally warn against setuid shell scripts. Here are the common security issues specific to scripts.
  • Anyone could create a symlink that resembles shell options and obtain an interactive shell with elevated privileges.
  • Anyone could create a symlink to a genuine setuid script, then after the OS executes the interpreter, perform a symlink switcheroo to a malicious script.
  • Command line interpretation can be changed by PATH or IFS.
Some people advocate (warning: bad advice) creating a binary specifically for executing the shell, but it can render other safety guards ineffective. The dynamic loader is more careful with the LD_PRELOAD and LD_LIBRARY_PATH directives with a setuid executable. But in this case, only the wrapper binary is setuid, but the shell run under is not, so it will pull in the whole slew of dynamic loading hooks unless we sanitize the environment.

One technique that an OS overcomes the switcheroo problem is to open the script file and pass /dev/fd/N to the interpreter, which ensures the same file is seen by the OS and the interpreter. This must be explicitly enabled on the OS and is still ignored by Linux. At least on Mac OS X, the interpreter's own setuid bit is also ignored.

When it comes to my own need, it basically boils down to:
  1. Don't bother with setuid/setgid, but use sudo which sanitizes environments. My use case is most similar to the line "ALL ALL = (script-owner) NOPASSWD: /path/to/script" in /etc/sudoers, to be run as "sudo -u script-owner /path/to/script". But sudoers manual begins with a "Quick guide to EBNF" which scares me.
  2. Rewrite my shell script as a full blown C program. I'm not feeling overly excited about that.
  3. Write a setuid/setgid wrapper binary in C, and the rest as shell script.
Needless to say, I took the last approach.

ps. In doing this, I found out about the shortest open source license called "Fair License" which is appropriate for such a short program.

Friday, January 1, 2016

Address Resolution Protocol for Application Specific Buffers

As part of the c10m problem, which advocates separating data plane from the control plane that routes the data, applications (such as ScyllaDB) have been built to bypass the kernel for data plane, opting to receive and transmit network packets directly from memory mapped buffers of a network card using a framework such as netmap or DPDK. But applications built this way hinges on exclusive control to the network interface. Running additional applications requires additional interfaces.

Previously, the kernel's network stack also acted as a dispatcher so that multiple programs can share one networking interface concurrently and only receive traffic destined for the specific program. On receiving the packet, the network card puts the packet into the next available RX buffer, hands it over to the kernel, and the kernel copies the data out after parsing the packet headers to determine how to route the data. Bypassing the kernel not only bypassed the data plane but also the control plane.

One solution is to let the kernel retain control of all RX buffers and handle the receive events. To dispatch the packet, the kernel would modify the application's address space to memory map the RX buffer for that packet. The problem with this approach is that first access of the mapping causes a soft page fault worth thousands of cycles, and unmapping causes costly TLB shoot-down. Rather than wasting that many cycles on the mapping, we might as well copy the data which can be done through a DMA controller.

If we want to avoid copying and the cost of mapping, then we need to perennially map RX buffers to the address space and somehow design the network interface to dispatch receiving packets to the right buffers. An application is identified by the IP address (layer 3) and port number (layer 4). The kernel traditionally handles the layer 3 and layer 4 semantics, but modifying the network card (layer 2) to understand higher layer is a violation of the abstraction imposed by the OSI model.

To keep a network card only aware of layer 2, each application will be assigned its own MAC address. The packets will only be directed to the RX buffers assigned to the packet's destination MAC address.

Traditionally, a MAC address can have one or more unique IP addresses, but one IP address uniquely maps to one MAC address. This means that each application will require its own IP address. For IPv6 this might not be a big problem, but for IPv4 this is salt on a wound caused by exhaustion of IP addresses which had already happened.

Address Resolution Protocol is used by a network gateway to translate destination IP address to destination MAC address as a packet enters a layer-2 Ethernet LAN from a layer-3 WAN domain. If we were to reuse an IP address across multiple MAC addresses, we need to modify ARP to resolve IP-port instead of just IP alone.

Under this architecture, the application-specific MAC addresses are assigned randomly with collision detection over broadcast. All programs on the same host will still share the same layer-3 address by differentiated by port number. The network gateway effectively handles the predisposition of packets into network interface RX buffers. The kernel retains the canonically assigned hardware MAC address for backwards compatibility. The kernel will still have its own RX buffers: they handle low-bandwidth network control traffic (ARP, ICMP, DHCP, etc.), packets destined to non-existent programs, as well as overflows when an application could not release its own RX buffers fast enough to receive more packets.

In summary, here are the architectural changes I'm proposing:

  • Modify ARP to resolve IP-port to application specific MAC addresses.
  • Each application on a host will have their own assigned RX buffers, identified by the application specific MAC address which is randomly assigned to not collide on the LAN.
  • The RX buffers are perennially mapped into the application's address space.
  • The kernel still handles the control plane, responding to ARP requests that maps IP-port to application specific MAC address.

This will realize an efficient zero-copy C10M solution with minimal modification to existing network architecture.