Tuesday, December 1, 2015

Recursively Extensible Open Addressing Hash Table

There are two common ways to implement a hash table: open addressing and separate chaining. Both use a hash function to map keys to indices within an array for fast lookup. They differ in the conflict resolution strategy when multiple keys map to the same index. Upon conflict, open addressing searches for the next available index in the array through a probing policy such as linear probing (increment index by the number of probing attempts), quadratic probing (add index by the number of probing attempts squared), or by enumerating through different hash functions. Separate chaining avoids conflict by maintaining keys of the same index as a linked list.
Open Addressing
Separate Chaining
Both tables can be subject to O(n) lookup for a naive implementation when there are many collisions. A better one would dynamically resize the array depending on the load factor, but resizing is an expensive operation. For separate chaining, performance degradation can also be mitigated by using a balanced binary search tree instead of a linked list. Since the worst case access time is bound to O(lg n), it can retain some efficiency without resizing.

So far, this is text book material prior art.

In terms of cache efficiency, separate chaining has the disadvantage that list nodes may be scattered, so accessing the list can cause many cache misses. Open addressing is more cache friendly because the key and values are stored in a contiguous array. The question is: can we do away without resizing like separate chaining, but enjoy the cache efficiency of open addressing?

Yes we can. Here I introduce the recursively extensible open addressing hash table.
A recursively extensible open addressing hash table keeps the key/value in the table, but also reserves a pointer to an extension subtable per entry. To resolve conflict, it would first probe within the same table for an available slot but limited to k tries with a small k. If the search fails, then it would descend to the extension subtable at the original hashed index, creating the subtable only if necessary. It would use an orthogonal hash function to compute the index within this subtable, repeat the probing, and descend if necessary.

An orthogonal hash function is linearly independent (in the linear algebra sense) to the hash functions used in parent tables. An easy way to compute the orthogonal hash is to consider a universal hash function producing a 32-bit unsigned integer output. The first level hash table has 256 entries, and the index is based on the value in the 8 least significant bits. The second level tables also have 256 entries, with index drawn from the next 8 bits, and so on for the third and fourth levels.

As for the probing policy, we could use any policy like any open addressing hash table would, but linear probing has the best cache locality, and it is sufficient to achieve high load factor without compromising lookup performance. That is because the subtables all enjoy O(1) access as well.

Tuesday, November 10, 2015

From Branch Prediction to Predictable Branching

Branch prediction is a technique in multi-stage pipelined CPU design to avoid pipeline stall. The reason of the stall is that early stages of the pipeline such as instruction fetch and decode cannot happen unless the CPU knows where to fetch the next instruction, but it has to wait for the result from earlier instructions to know which branch to take. Branch prediction speculatively executes one of the branch. If the prediction turns out to be good, the results of the execution is committed, otherwise the pipeline is flushed, and execution restarts from the actual branch.

There are two constructs in the program that cause branching. Conditional branches come from if-statements, where the code path takes the then-clause if the condition is true, or the else-clause otherwise. Indirect branches come from function pointers (used by callbacks), virtual method tables (used by object-oriented languages), and switch-statements. Unconditional direct branch does not need to be predicted.

Branch prediction has been well-studied, such as how various CPU microarchitectures implement branch prediction, and the limits of branch prediction. Since prediction misses are expensive, they measure the effectiveness of branch prediction by the percentage of hits. But I argue that branch prediction might not be the best approach to solve the pipeline stall problem. The better idea is to make branches predictable.

Predictable Indirect Branches

Let's say the program is about to call a virtual method of an object. The first word of the object is a pointer to the virtual table. The method pointer is the 7th word in the virtual table. The object pointer is stored in register %r0, and the stack pointer is in register %sp. For brevity, pointer arithmetic is expressed in the unit of a machine word.
; zeroth argument to the method is the object pointer.
store %r0 -> %sp + 0

; compute the arguments and store them.
store %r1 -> %sp + 1
store %r2 -> %sp + 2

; prepare to call the method.
load %r0 -> %r3      ; load address of the virtual table
load %r3 + 7 -> %r4  ; load address of the method
call %r4
The reason of the pipeline stall is because the call instruction doesn't know what is in %r4 until the instruction to load method address into %r4 is executed.

I included the code to compute the function call arguments to emphasize the fact that there is a lot of work that the CPU could be doing before the real branch target is computed. We can avoid the pipeline stall if we just reorder the work slightly.
; prepare to call the method.
load %r0 -> %r3      ; load address of the virtual table
load %r3 + 7 -> %r4  ; load address of the method
setcall %r4  ; set the branch target but do not branch yet.

; zeroth argument to the method is the object pointer.
store %r0 -> %sp + 0

; compute the arguments and store them.
store %r1 -> %sp + 1
store %r2 -> %sp + 2

call  ; do the actual call.
In this latter example, we compute the branch target first and do the other work later. By the time the pipeline needs to decode the next instruction after the branch, setcall should have been executed and provided us with the actual target. If not, it is okay for the pipeline to stall for a bit. Programs that do not use setcall/call can still benefit from the branch prediction buffer.

Predictable Conditional Branches

Conditional branches are a bit trickier because we can't generally precompute the condition. One technique used by the programmer is to compute both branches and mask the result of either one. The mask can be created using a bit manipulation technique. For example, the statement:
int result = condition? expression_true : expression_false;
Can be decomposed as:
int result_true = expression_true;
int result_false = expression_false;
int mask = to_mask(condition);
int result = (mask & result_true) | (~mask & result_false);
This approach only works if the expressions are trivial to compute. Otherwise the redundant work would be more expensive than pipeline stall.

A more general approach recognizes the fact that conditional branches are often asymmetric: one branch is the fast path in a tight loop, and the other branch is the slow path. Take for example an implementation of the strlen(s) function.
size_t strlen(const char *s) {
  size_t n = 0;
  while (*s)
    ++n, ++s;
  return n;
Its assembly pseudo-code might be:
  load %sp -> %r1  ; load s into %r1
  xor %r0, %r0 -> %r0  ; zero %r0 as n

  loadbyte %r1 -> %r2  ; dereference *s
  bz %r2, endloop  ; branch if zero
  inc %r0
  inc %r1
  b loop  ; unconditional, no prediction needed.

  r  ; return value in %r0
The basic block between loop and endloop is the tight loop. One way to argue that it must run as quickly as possible is because it runs many times; that's the classic branch prediction argument that sought to maximize the probability of prediction hits.

But we can also argue that branches should always favor the tight loop because the consequence of taking the bz branch result in more expensive code (function return), whereas the cost of not taking the branch is cheap. This is true even if s is an empty string. In other words, we are making the branch predictable by analyzing whether the cost of prediction miss can be amortized into the code that follows.

The consequence of misprediction can be visualized as the following table:

Fast PathSlow Path
HitFor The Win!Not a big win.
MissHuge penalty!Small penalty.

There are two ways to distinguish fast and slow paths. The compiler can insert branch hints, which should work well because branch misprediction cost is static. The CPU can also decode ahead and compute the path cost of both branches. In the case where it is difficult to tell between the fast and slow paths, the CPU can always use the branch prediction heuristic based on historical data.


Here I presented methods to make branching predictable. For the indirect branch, the idea is to compute the branch target as early as we can before the actual branch, and there are usually more work that could be done in the mean time. For the conditional branch, the idea is to distinguish the fast and slow paths, and prefer to branch to the fast path because the slow path is penalized less by miss cost. Both methods can be used in addition to the branch prediction buffer based on historical data.

Thursday, August 6, 2015

Practicality of the Vector Packing Problem

Vector Packing problem is a generalization of the Bin Packing problem. In the simplest form, the Bin Packing problem seeks to find a placement of objects in a bin (both one-dimensional) such that the smallest number of bins are used. With vector packing, both the bin and the objects can have several dimensions, and different sizes along these dimensions.

Even in the simplest form, bin packing is an NP-hard problem. An approximation algorithm that runs in polynomial time may not produce the optimal solution, as you can see in the following illustration.

The first fit algorithm sorts objects by size, and then place each object in the leftmost bin that has space that fits. Heuristic algorithm can try harder to find a better solution at greater computation cost. You can imagine that vector packing problem is even more complex, now that the algorithm has to consider several dimensions of constraints.

Researchers understand that vector packing problem has a very useful application in cloud computing, and this has inspired a wealth of research literature. In this application, bins represent machines in a computation cluster, and each bin has a dimension for CPU, RAM, disk space, and network bandwidth. The objects represent jobs to be scheduled in the computing cluster. A job has specifications for the amount of CPU, RAM, disk, and network it uses. We can pack several jobs into a machine as long as the sum of the resources used is under what the machine could provide.

Better packing means that unused machines can be powered off to save energy, while the online machines have higher utilization which helps them achieve better energy efficiency. But as the number of jobs and the number of machines grow, even a heuristic algorithm can take a long time trying to find a better packing solution. This can grind cluster scheduling to a halt.

But I want to call this a fallacious understanding of the real problem we want to solve. In practice, cluster scheduling differs from the vector packing problem in the following regards:
  • Bin or vector packing is a static problem where, once the objects are placed, they are not meant to be moved or displaced with other objects. Scheduling is inherently a dynamic allocation problem where objects come and go (i.e. they have a lifetime). An optimal packing does not imply optimal scheduling.
  • The scheduling as a dynamic allocation problem is usually tackled as an offline problem where all the objects and their lifetimes are known in advance. In practice, we do not know the lifetime of jobs, which may stay running for a long time, but a human operator can always restart jobs or start new ones.
  • A cluster typically have machines of a variety of specifications that differ in CPU speed, number of cores, amount of RAM, etc. Bin or vector packing assumes that all bins are of the same size.
Furthermore, in real life, job specification may not accurately reflect its actual usage. A job that says it uses a certain amount of CPU or RAM might end up using more, or using less. Or the usage may be sporadic. If it leaves garbage files around, then its disk space usage would grow indefinitely.

To define the cluster scheduling problem better, we need to understand that some of the dimensions are soft constraints, while others are hard. In general, temporal constraints are soft, while spatial constraints are hard, but there are exceptions.
  • CPU is a soft constraint. If it is over-subscribed, jobs will just take longer to compute.
  • Network bandwidth is also a soft constraint. If over-subscribed, then jobs will just take longer to send or receive data.
  • RAM is a spatial constraint, except that can actually be over-subscribed because of virtual memory. When memory pressure is high, rarely used memory pages are swapped to disk to meet new memory demand. This swapping can cause performance issues, but it now allows a temporal trade-off, which makes RAM a soft constraint.
  • In some deployment scenario, disk seek time is a scarce resource, but it is still a soft constraint. If over-subscribed, then jobs will simply spend more time waiting for disk I/O.
  • The only remaining hard constraint is really the disk space.
    These observations simplify cluster scheduling to become more like a one-dimensional bin packing problem, where jobs are scheduled only based on the amount of disk space it expects to use. To meet the soft constraints, we simply monitor each machine to see if the CPU, RAM, network, or I/O is oversubscribed. If so, the machine monitor will try to move contending jobs to different machines. But over-subscription will not prevent us from tentatively scheduling new jobs to run on this machine since we don't know about their performance characteristics yet; they may not cause any more contention.

    It's not always a good idea trying to solve a practical problem with a theoretical one, since a theoretical solution is likely a lot harder, and will still miss the mark. Vector packing problem is not a practical representation of the cluster scheduling problem.

    Thursday, July 23, 2015

    Tweet sized Mac OS X 10.10 exploit

    It took me a while to understand what this exploit is doing, so I thought I would make a note for my own benefit and for others.
    echo 'echo "$(whoami) ALL=(ALL) NOPASSWD:ALL" >&3' | 
    DYLD_PRINT_TO_FILE=/etc/sudoers newgrp; sudo -s # via reddit: numinit (shorter)
    The entire line is a shell command that adds the current user to the /etc/sudoers file and grants it sudo access without password, then runs sudo to escalate to a shell with root privilege. The greyed out part is the comment. But how it accomplishes this is very clever.

    Here is what the rest of the code does: newgrp is a command that starts a command line shell after setting or resetting the effective group. It is a setuid root program which means it runs with escalated root privilege, but newgrp only runs the sub-shell after downgrading the privilege back to the current user and the desired group. The shell interpret commands from its standard input, so this fragment pipes a command to the sub-shell which runs it.
    echo 'echo hello' | newgrp
    The sub-shell started by newgrp runs the command echo hello which prints out hello to the terminal. In the exploit, the command passed to the sub-shell is echo "$(whoami) ALL=(ALL) NOPASSWD:ALL" >&3 which writes a sudoer line for the current user to file descriptor 3.

    How does file descriptor 3 come into play? When you run newgrp with the environment variable DYLD_, it gets interpreted by the dynamic loader (dyld) with a specific meaning. The purpose of the dynamic loader is to piece a program together from different parts. It runs as the same privilege of the program.

    On Mac OS X 10.10, the man page describes DYLD_PRINT_TO_FILE as follows.
    This is a path to a (writable) file. Normally, the dynamic linker writes all logging output (triggered by DYLD_PRINT_* settings) to file descriptor 2 (which is usually stderr). But this setting causes the dynamic linker to write logging output to the specified file.
    Before dyld could write to a file, it has to open the file and obtain a file descriptor. Here DYLD_PRINT_TO_FILE=/etc/sudoers tells dyld to open /etc/sudoers which is normally only writable by root. It succeeds because dyld runs as root since newgrp is setuid root.

    In the usual case, file descriptors 0, 1, 2 are reserved for standard input, standard output, and standard error, so the next available file descriptor is 3 which is assigned to the newly opened file. Now dyld is not told to actually write anything, but it leaves the file descriptor 3 open. Unless file descriptors has O_CLOEXEC (see open), they are kept open across fork() and exec() which are system calls for starting a new program. The same file descriptor, now with access to a file that normally root can open, is seen by the sub-shell run by newgrp even though the sub-shell is not root.

    The command run by the newgrp sub-shell then modifies /etc/sudoers so that the current user could run sudo without a password, by writing the line to file descriptor 3.

    There are several ways to block this exploit.
    • dyld should strip all DYLD_ environment variables if it realizes that the current program is setuid. This is the approach taken by the Linux dynamic loader. These environment variables alter the behavior of dyld, which is undesirable for setuid programs.
    • newgrp should close all file descriptors except 0, 1, and 2 before running the sub-shell. Most of the times, file descriptors are not intended to be passed around across processes.
    • newgrp could also close file descriptor 0 (standard input) and instead reopen the controlling terminal. This disallows passing command via standard input.
    • In general, setuid programs that start a sub-shell should not do so without requesting password from the terminal. Programs like login, su, and sudo are generally okay.
      • Command to find setuid root programs: find /bin /usr/bin /sbin /usr/sbin -user root -perm -4000
    • The open() call might instead randomize the file descriptors returned. Conventionally it always uses the lowest numbered file descriptor, but this has no tangible benefit.
    • DYLD_PRINT_TO_FILE could open the file using O_CLOEXEC. When the current program starts a child process, the current file descriptor would be closed, but the dyld of the child process will see this environment variable again and attempt to open it using the appropriate credentials of the child. As long as the file is opened in append mode, the parent and child can both write to the same file atomically without overwriting each other's content.
    Doing all will prevent any of these components from becoming part of a pathway for a future exploit.

    Saturday, May 23, 2015


    A coworker sent an e-mail with a list of questions that some grade school kids have for people in the field of computer science. The questions are intriguing enough to me that I want to spend some time personally thinking about them and answering them. The questions fall under these categories: (1) personal inspiration, (2) classes or learning, (3) about the computer science field in general, (4) personal projects, and (5) women in CS. The 5th category is because the questions came from a forum focused on girls in school.

    Personal inspirations

    What got you into coding? What inspired you? ...

    I have an uncle who worked in electric engineering. When I was in the third grade, he was the only one in our family with a computer (and a handheld camcorder). My mom brought me to his home after school, while he was still at work, and let me play with his computer, mostly games but once I drew Goemon Impact using MS Paint. I also destroyed many expensive IC chips that he kept in his drawer by my ill-conceived attempt to design my own imaginary circuit board: I didn't know about soldering, so I bent the pins to make the chips stay on the board. He wasn't married back then, so he spent many girlfriendless lonely nights unbending the pins.

    For some reason he never reprimanded me. I feel very sorry every time I think back. Note to self: do not let kids play with the stuff in my drawer.

    My mother was looking to learn to use the "personal computer," and she borrowed a book from my aunt about how to use MS-DOS. My mom never ended up reading the book, and for some reason the book fell into my hands. Back then, operating systems came with some programming tools, so one of the chapters in the book was about using the assembler MASM and how to diagnose programs using DEBUG.COM. These are programming tools at a very low level, very close to how the actual hardware works.

    One day in third grade, my mom brought me to a computer shop with a books section, after some after-school science class that she sent me to. As I was browsing the books, I found one that says MS-DOS and programming in the title. I had no idea what was in it, but as I flipped through the pages, I accidentally ripped a page. It was a minuscule tear, but I somehow felt obligated, so I convinced my mom to buy it. I determined that I would read it enough to understand it completely so I don't waste my parent's money. The book was about the MS-DOS system architecture, and was the second volume of a three volume series. The third volume was about MS-DOS system calls and contained programming examples on how to use them. I eventually got the third volume as well, but I never went back to the first volume.

    So as fate has it, my first programming language was assembly in 16-bit Intel 8086. It was as close to the hardware as it could be. I was writing instructions that perform arithmetics using machine registers and jump to memory addresses. When my program crashes, it also crashed the whole system, so I had to hit the reset button to reboot the computer. I did that many times, and the button became very polished. But if you think about it, low level systems were really simple because it had to be. Modern hardware is more complicated, but there is still a subset that is somewhat approachable.

    In fourth grade, my mom sent me to a class that taught programming in BASIC. It was a rather high level but unstructured language. I think it was helpful to learn computing from both the high and the low level. Then I went to the middle and learned C, and later C++. The kind of programming I did was all pretty naive. I knew about table lookup, but didn't know about binary search or data structures until college. Data structures are about how to organize data in a computer so they could be inserted, queried, and deleted quickly. Not knowing how to organize data effectively means your program will likely run slower than it needs to. This brings us to the next topic.

    Classes and learning

    Why do you think it's important to study computer science? What classes should I take?

    Before answering questions about classes and learning, we need to understand what computer science is about.

    Computers are stupid. They follow your instructions and will do exactly that. Computer science is about how to make computers smart and more effective, so you could do more with these otherwise incredibly stupid machines. What makes computer science a prolific field is that both hardware and software can be cheaply replicated, so a good computer scientist can do great things with computers, and more by adding more computers.

    For this to be possible, you need to learn about how to organize data using data structures, how to solve problems with algorithms, some theories about predicting the program's space and time requirements and other performance characteristics, how to program the computers to talk to each other, and finally, how to make sense of the programs you write.

    Furthermore, not all solutions are silver bullets. A solution may perform well under a specific condition, but it might suffer a bad worst-case scenario, or it may have other trade-offs. It is important to know them so you can apply the right solution to the specific problem at hand. A college class would most likely be using sorting algorithms to illustrate these concepts. Sorting rearranges numbers so they appear in increasing order, such that the number after is always greater than the number before. Of course you could sort in decreasing order as well, but you only need to make minor changes in your sorting algorithm to do that. Or if you're a clever programmer, you implement the sorting algorithm once that can be instantiated to sort in either order, and can work with data other than numbers.

    When it comes to the trade-offs of sorting algorithms, quick sort tends to be the fastest in practice, but could be very slow if the numbers are almost sorted or inversely sorted. Insertion sort is generally slow, but is fast if the numbers are almost sorted. Merge sort has no such "it depends" scenario but tends to be slower because it moves the data around more, and for a simpler implementation, uses more space.

    Such trade-offs are prevalent in the field, and some people dedicate their lives to study really subtle trade-offs, when such trade-offs can still cost millions of dollars on really big problems.

    Coding can be learned independently any time from books or websites, but I find that computer science concepts are better learned in a formal higher education setting, e.g. in college, and in some cases, grad school.

    About the computer science field in general

    What do you find most challenging? What do you think the computers will be like in the next 50 years?

    Remember, computers are incredibly stupid machines. The computers these days appear smart only because some dedicated computer scientists figured out how to make them so. I am of the opinion that computers can only be as smart as the person programming them; and that even if someone figures out how to make computers program themselves, the cumulative result of computers programming itself to program itself will converge to a finite limit. Of course, that means that computer scientists cannot be replaced by computers, so that guarantees some job security!

    The difficulty is trying to figure out what problems there are left to solve. Many problems such as sorting, data structures, networking, and even distributed computing have good enough solutions, which means the likelihood of breakthrough is small. There are some open problems pertaining to computer vision and artificial intelligence. Some problems are interesting in a theoretical sense but difficult to prove. An example is the P =? NP problem: whether a deterministic machine (computation happens in a single sequential universe) has less computing power than a non-deterministic machine (computation happens in unlimited number of parallel universes). It's still unsolved.

    Cryptography is another highly anticipated field. It's about information security, so that only the intended parties may exchange messages, and no other parties can figure out what the messages are even if they are intercepted. The assumption of cryptography is that it would be computationally expensive to decrypt a message without knowing some secret. But computers are getting faster, which makes it easier to guess the secret. The challenge there is to find different ways to encrypt a message that still makes it computationally infeasible to decrypt without the secret.

    My personal conviction is to improve the way we understand programming so that programs are easier to understand and less error-prone. But again, this is also a very mature field with mixed results. There is some potential for advancement, but it's not clear how it's going to be.

    Personal projects

    What kind of things do you program? What programming language do you use?

    I am currently trying to understand limitations of existing programming languages in order to find new language design or paradigm that would alleviate these limitations. The limitations aren't about the computation power, but it's about how to write programs and reuse them more straightforwardly. I'm writing an assortment of common routings in C++ that might help me identify these opportunities.

    Women in CS

    What does being a women in CS entail for you? What are some challenges you have encountered as a woman in CS?

    Unfortunately I cannot speak for women in CS because I am not a woman. But I love giving the example of Ada Lovelace as the first programmer in history, a woman. She was studying mathematics under Charles Babbage because her mother thought it would remedy a family history of mental illness from her father's side. Charles Babbage designed the first analytical engine, a computer made of gears and shafts. He never built them, but in recent years the Computer History Museum have built one according to his original design and demonstrated that it worked.

    Computer science is a gender neutral subject, but as in any field, minority groups may be subject to political discrimination for any number of things from acceptance of academic papers for publication to salary and promotion. Women are not the only discriminatory group. Being a Chinese, I think we are discriminated in some ways as well. Once I submitted a paper for review---it was only single blinded which means I do not know the reviewer, but the reviewer can see my name---and a reviewer made a non-sense comment about my grammar as a reason for rejecting my paper.

    I think there is also a difference on the learning process between male and female. My hypothesis is that males are more accustomed to unstructured learning (i.e. figuring things out yourself) while females benefit more from structured learning (i.e. in a classroom with a course plan and assignments). Since computer science is a discipline that arose in the last 50 years or so, which is nascent compared to literature, mathematics, or even natural sciences, we have still yet to develop effective structured education, which may cause more women to struggle through the study. This situation will likely improve especially when teachers start to import computer science into the high school curriculum. College professors are definitely the subject expert, but grade school teachers know more about education theory and effective teaching than college professors.

    Closing remark...

    If you found this blog post because you are a young student looking to enter the field of computer science, I hope this is helpful to you. If you are a computer science veteran, I welcome your feedback.

    Saturday, March 7, 2015

    Designing an Automatic Laundry System

    Why It’s Almost Impossible To Teach a Robot To Do Your Laundry?

    It's not impossible. We just need to simplify the tasks to make it tractable for machines. Here are the steps outlined in the pessimistic post, and how to simplify them.
    1. Find the pile of dirty laundry, distinguishing it from other clutter that might be in the room.
      • Instead of tasking the robot to tell apart garments from clutter in the room, the robot should receive the garments from a designated drop-off point.
    2. Pick up each item in the pile. (Uncertainty: it’s unclear how many objects the robot will have to pick up.)
      • The drop-off will contain bins or bags classified by garment type, so the robot will never have to pick up garments and put them in the basket.
    3. Put each item in a laundry basket.
      • As far as the robot is concerned, the garments are already sorted.
    4. Navigate to the washing machine. (Because of where the robot has to hold the laundry basket, it can obstruct some of the its sensors which means it receives less information and cannot adjust its movements as precisely.)
      • In some homes, it is possible to build laundry chute to solve the transportation problem.
    5. Depending on the type of machine, pull or lift the door to open it.
      • The machine could be designed with a standardized and easy to open door.
    6. Transfer clothes into the machine.
      • We can design the laundry bins to be the canister that fits into the washer and dryer as part of the driver, so we don't need to transport clothes between containers. In the case of laundry bags, the bags themselves will go into the washer and dryer, so they will be made of meshed design.
    7. Add detergent and/or fabric softener.
      • The detergent and fabric softener should already be in a container that feeds into the machine. Human operator only needs to check the liquid levels periodically or receive an automated alert when they run low.
    8. Close the washing machine door.
      • Again, design the door to be easily operable.
    9. Choose the appropriate wash cycle (Delicate, Permanent Press, Heavy Duty) and start the wash.
      • Each bin or bag at the drop-off will select the wash cycle. This eliminates the selection process. In the future, we might design a sorter based on an RFID tag sewed onto the clothing, and have the sorter put clothes in the bin or bag automatically.
    10. Remove the clothes from the washing machine and transfer to the dryer. (Uncertainty: the robot doesn’t know beforehand how many times it will need to reach in, grab the clothes, and remove them in order to get them all.)
      • Again the bin or bag will facilitate transport from the washer to the dryer in batch.
    11. Choose the type of drying cycle and start it.
      • The bin or bag type will determine the drying cycle.
    12. Remove clothing from the dryer. (Uncertainty: how many times will it have to grab the clothes to get them out? Is there a sock still clinging to the inside of the machine?)
      • At the end of the cycle, the bin or bag will be transported either to a person to fold the clothing, or in the future, to a folding machine.
    13. Fold items depending on the type of apparel.
    14. Puts garments away in a dresser or closet.
      • Instead of retrieving clean clothes from a closet, perhaps the machine will just put the folded clothes packed in a box. It saves space too.
    The idea is that let humans do what we do best, and let the robots do the rest. Most of the steps are already solvable using today's technology without sophisticated artificial intelligence. Some of the steps involving human can use a designated machine in the future, such as sorting and folding.

    Sunday, March 1, 2015

    Why Perfect Forward Secrecy won't work over PGP?

    Repeat the title here: Why Perfect Forward Secrecy won't work over PGP?

    tl;dr because it's a logistic nightmare.

    Perfect forward secrecy uses a one-time session key to secure a cryptographic communication channel using a long term key. The session key is negotiated in such a way that gaining access to the long term key will not recover the session key, so if the long term key is compromised, it cannot be used to decrypt a previously eavesdropped communication.

    Of course, the compromised long term key could be used by a malicious third-party to impersonate as the victim in negotiating the cryptographic communication channel in the future.

    An example of perfect forward secrecy is Diffie-Hellman key exchange. Such key exchanges are typically implemented over TLS where two computers automatically negotiate the cryptographic channel transparently to the users.

    PGP, on the other hand, involves the user heavily as part of the negotiation. First you need to decide whom you include in your Web of Trust, by verifying in person their public key fingerprint. Without this step, you could be encrypting an important message for the wrong party. Then when you encrypt, you need to pick a cipher algorithm that your recipient's software knows how to decrypt. Back then it was a confusing boatload of 3DES, IDEA, TwoFish, BlowFish which sound like something from Dr. Suess. Nowadays you could count on AES. PGP then creates a random symmetric key, encrypts it with the recipient's public key, and produces the cipher text for you to deliver. This process is already onerous in itself. Only the most determined individual would put up with this effort.

    For Perfect Forward Secrecy to work, both the sender and the recipient have to contribute to the creation of the one-time session key. You need to first complete the key negotiation, send the negotiation to the recipient and hear back, before you can encrypt the actual message. You can't have Perfect Forward Secrecy without participation from the recipient. This is a logistic nightmare.

    This is not to say that even more determined pair of individuals could pull this off.

    Sunday, February 15, 2015

    Extended Datagram Protocol

    I need a datagram protocol as a basis for transporting RPC over a network. I'd like to call it Extended Datagram Protocol or XDP.

    Here are the requirements:
    • Message size will be larger than MTU up to ~128KB. Most networks support Ethernet MTU of 1500 bytes, which means large packets need to be fragmented. IP fragmentation handles packet loss poorly. If a single fragment is lost, the whole packet will be dropped.
    • The fragments are sent over UDP. XDP will handle packet assembly with forward error correction. Each fragment will have at most 1280 bytes of payload. It is a nice even number and leaves handsome room for the headers.
    • The RPC layer will allow messages to arrive out of order, so the datagram protocol does not need to worry about packet ordering.
    • The datagram protocol does not need to implement message acknowledgement. It may be handled by the RPC layer, or even the application layer.
    • Encryption will be done at the RPC layer.
    • Session management will be done at the RPC layer.
    I don't have a concrete design yet, but I think it is worth mentioning the rationale behind these requirements. There are a number of proposals that extend UDP such as Reliable User Datagram Protocol and Stream Control Transmission Protocol that don't fit the bill. In-order delivery of packets actually becomes a head-of-line blocker, such is what QUIC tries to avoid.

    QUIC (Quick UDP Internet Connections) features packet level error correction, but it is trying to do a lot more such as flow control and encryption. I think flow control should be handled by a layer above (e.g. RPC layer) which can do flow control not only based on network condition but also server load. I don't mind that it handles encryption. However, crypto negotiation packets can easily be larger than the MTU, so I think it would be easier to implement encryption with proper large message support provided by a simpler datagram protocol. I'm not trying to argue that QUIC should be designed differently, only that I think my requirements are different and not well-addressed.

    An important design philosophy I have is that we need to share protocol implementation as much as possible, perhaps even across layers. TLS basically invented its own RPC protocol, and it uses X.509 which uses the ASN.1 message serialization scheme that differs from TLS. If I were to layer my own RPC on top of TLS there would be three protocol implementations.

    The reason Extended Datagram Protocol does not handle anything besides message fragmentation is that everything else could really be handled by the RPC layer that has a strongly-typed and unified message serialization/deserialization scheme. The only difficulty that message serialization has trouble dealing with is the framing. If we were to send a message over a stream based protocol like TCP, we would first send the message length, then the message body. The sender could do this in one socket write, but the receiver cannot do this in a single read. The receiver would have to first read the message length, then the message itself. The receiver could speculatively read extra bytes in one socket read, but it would need a way to pass the excess data to the next receiver, often complicating the design of the protocol stack.

    Saturday, February 14, 2015

    Let Organizations be Smart

    Why is my smart home so fucking dumb?

    There are generally two philosophies to design a complex system. One is to centralize all control to a brain that aggregates data from all sensors and decides what to do with its agents. Another is to distribute the intelligence to its agents.

    A smart home is an example of a complex system. Let's say one of the design goal is to have motion activated lights that turn on with moving objects in the room and turn off after some inactivity. A centralized design would send motion data triggers to the brain; the brain would figure out which light to turn on, and send the command to turn on the light. The brain would keep a timer and turn off the light as needed. A distributed design would build the motion sensor and the timer right into the light switch in the same room that is hard wired to the light bulb.

    One might argue that the distributed design is inflexible. What if you want the motion sensor to do something else other than turning on the light bulb in the same room? But the problem with centralized design is that it is a single point of failure. Imagine when you are upgrading the software in the brain, or if you somehow bricked the brain, then you can't control any light in any room in the home. How inconvenient would that be?

    If you take inspiration from a living organism, the responsibility of control is split depending on the function. A creature is operated by the brain for objective, strategy, and coordination (e.g. hungry, find food, eat), but functions like reflexes that need critical response time and heart beats that need critical reliability are regulated locally. Not all actions need involvement from the brain, only the actions that need to be coordinated.

    A home that needs to be smart could take this hybrid design as well. The light switch could have a builtin motion sensor and timer, but it could also be controlled by a brain to override its function.

    This similar hybrid design can also be applied to corporate decision making. You always hire people who can think independently and figure out on their own how to get things done, but you lead them with objective, strategy, and coordination.

    Friday, February 13, 2015

    Crypto Ecosystem

    The crypto stack is complicated. There are many APIs and implementations of the same API. There are different ways the components could be configured to interact with each other. Here is my attempt at visualizing the various relationships of the crypto ecosystem.

    Here is the original drawing.

    The depicted components:
    • Applications: Chrome, Firefox, GnuPG, Keychain Access, OpenSSH.
    • APIs and protocols: CCID, PC/SC, PKCS#11.
    • Smart card drivers: CryptoTokenKit, OpenSC, *.tokend.
    • Hardware: PKCS#15 card, OpenPGP card.

    Saturday, February 7, 2015

    Yubikey NEO-n

    Just got my Yubikey NEO-n today and here is how I got it to work.
    • Downloaded Yubikey Manager (optionally, also Yubikey Personalization Tools).
      • I changed the connection mode to OTP+U2F+CCID.
      • Yubikey Personalization Tools only work if OTP mode is enabled. If not:
        • Both ykpersonalize, ykinfo report "no yubikey present"
      • Yubikey Manager can only enumerate CCID apps if:
        • No other programs are using PC/SC (e.g. gpg-agent).
        • ifdhandler has to be loaded (which may not be the case if the workaround below is applied for gpg-agent getting stuck).
      • The key already has OpenPGP app installed among others.
    • Followed the instruction in My Perfect GnuPG/SSH agent setup.
      • For Mac OS X, I downloaded GPGTools which came with the gpg command line and the gpg-agent.
      • If the 'generate' command doesn't ask for "Please specify how long the key should be valid" it might be that gpg-agent got stuck. Run killall -KILL gpg-agent and try the gpg command again.
    • Mac OS X Yosemite ships with a really buggy pcsc-lite implementation. Here is a workaround so that gpg-agent would not get stuck again.
      • sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.ifdreader.plist
        • I added -w to make the unload persist across reboots.
      • After this, scdaemon will use its own CCID driver to talk to Yubikey. Both OTP and U2F will still work since neither are interfaced through PC/SC.
      • Yubikey Manager will suffer reduced functionality since it no longer has CCID access:
        • It can no longer enumerate Available apps.
        • If OTP and U2F are both disabled, it will report "No device found" even if CCID is enabled.
        • Use launchctl load -w to re-enable.
    Currently I'm using Yubikey NEO-n with OpenPGP key for SSH login. Although the OpenPGP key is protected by a PIN, I only need to enter it once, and the key remains unlocked until I remove the NEO-n from USB. The problem is that while the key is unlocked, any malicious program I run could then gain access to my gpg-agent and impersonate me, which is not very secure.

    With U2F, a touch would be required before authenticating with a server, which makes impersonation more difficult. A remote attacker would have to convince me to touch the Yubikey physically. There is a patch in progress making U2F work with SSH directly, but it hasn't been accepted upstream.

    Saturday, January 24, 2015

    Managing Odroid and "mlinuxguy" changes in git

    Here are my notes getting a Debian Linux kernel package built for Odroid C1. I prefer to manage the Linux kernel source under its natural habitat (git) so let's clone it. I'm using the googlesource.com mirror but feel free to substitute another one.
    git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable.git linux
    cd linux
    Now let's add hardkernel's fork as a remote repository. I'm also adding the repository owned by mlinuxguy who has been working on improving the ethernet driver. The benefit of adding these repositories to the same clone is that they can share objects, and I can easily git-diff the changes.
    git remote add hardkernel https://github.com/hardkernel/linux.git
    git remote add mlinuxguy https://github.com/mlinuxguy/odroid-c1-network-driver.git
    git fetch --all
    Unfortunately, mlinuxguy only included the files he changed under drivers/amlogic/ethernet but not with the full path, so we need to fix up his commits before we can merge them. We'll create a new local branch and then do a filter-branch to rewrite the commits.
    git checkout remotes/mlinuxguy/master -b mlinuxguy-full
    git filter-branch --prune-empty --tree-filter '
        mkdir -p drivers/amlogic/ethernet/phy &&
        git ls-tree --name-only $GIT_COMMIT |
        xargs -I FILE mv FILE drivers/amlogic/ethernet/FILE'
    Now let's create a branch that merges hardkernel's odroidc-3.10.y branch with the changes from mlinuxguy.
    git checkout remotes/hardkernel/odroidc-3.10.y -b odroidc-3.10.y-mlinuxguy
    git merge -s recursive -X theirs mlinuxguy-full
    Notice: the ethernet's MAC address changed some commits after 764e57, so you will need to reset /etc/udev/rules.d/70-persistent-net.rules; otherwise you might not have network access.
    And now we have a kernel tree ready to be built. If you have not done so, install the prerequisite build packages.
    sudo apt-get install build-essential kernel-package lzop u-boot-tools initramfs-tools
    The Debian make-kpkg program allows you to configure a custom kernel using an existing .config, which we generate using Odroid C1's default.
    make odroidc_defconfig
    The following command will build linux-image-* and linux-headers-* in the parent directory. It takes about 50 minutes on the Odroid C1 which is not bad (the kernel itself took 20 minutes, the rest are modules and packaging). It saves me the hassle of messing with cross-compilation toolchain. It also means that I get to use Debian's gcc, not some binary from an untrusted third-party.
    nice time make-kpkg -j5 --rootcmd fakeroot --initrd --append-to-version=-odroidc+mlinuxguy kernel_image kernel_headers
    Note that the dtbs (device tree blob) is not packaged, so if you want to update it, you could make dtbs from the Linux source (the path is shown in the make output) and copy it to /boot manually. You should also be able to just use the old one.

    You can then install the Debian packages using the dpkg -i command which installs the vmlinuz-* under /boot. The initramfs-tools hook will create the initrd.img-* file under /boot as well. Note however that vmlinuz and initrd.img has to be converted to uImage and uInitrd for U-Boot. Rather than making the uImage under the linux source tree, we'll generate it from /boot. I'm using the same parameters that Odroid C1 kernel uses as revealed by running mkimage -l on an existing uImage and uInitrd.
    cd /boot
    mkimage -A arm -O linux -T kernel -C none -a 00208000 -e 00208000 \
        -n "Linux-3.10.44" -d /boot/vmlinuz-3.10.44-odroidc+mlinuxguy \
    mkimage -A arm -O linux -T ramdisk -C none \
        -n "uInitrd 3.10.44" -d /boot/initrd.img-3.10.44-odroidc+mlinuxguy \
    (TODO: I could not get any compression working under U-Boot possibly due to version mismatch, but uncompressed works.) (TODO: you could automate this with a kernel install hook, like how update-initramfs works.)

    If something fails to boot, you could boot into the original kernel in U-Boot like this:
    fatload mmc 0:1 0x21000000 uImage
    fatload mmc 0:1 0x21800000 meson8b_odroidc.dtb
    fatload mmc 0:1 0x22000000 uInitrd
    setenv bootargs "console=ttyS0,115200n8 root=/dev/mmcblk0p2 rootwait ro no_console_suspend vdaccfg=0xa000 logo=osd1,loaded,0x7900000,720p,full dmfc=3 cvbsmode=576cvbs hdmimode=720p m_bpp=32 vout=hdmi disableuhs"
    bootm 0x21000000 0x22000000 0x21800000
    After the build, to return the source tree back to prestine state (no littered build artifacts), run:
    git clean -f -x -d

    Tuesday, January 13, 2015

    Installing Debian on Odroid C1

    After my adventure of installing Debian on Pogo, I decided to ditch the Pogo and go for Odroid C1 which has gigabit ethernet and a quad core armv7 running at 1.5GHz. My objective again is to install Debian from scratch using the installer over serial console, without using a flash image that somebody else prepared.
    Warning: The ethernet driver for Odroid C1 is buggy and may have auto-negotiation problems which prevent the Gigabit ethernet from achieving line speed.
    Here is my shopping list:
    The USB UART Module Kit contains three pieces: the USB cable, a small board with FTDI chip, and a Molex cable. I mostly bought the kit just for the Molex cable because Odroid C1 serial console has a Molex socket. I kind of wished that Hardkernel could just sell the Molex cable separately so I could use the USB-TTL adapter I already had for Raspberry Pi.

    The Molex cable could be mated to a commodity USB-TTL adapter by inserting the short-end of the break-away headers to the Molex connector, and the long end to the TTL.

    Beware: I found out that you could (accidentally) power the Odroid C1 through the serial port if you connect the red wire, but the Ethernet chip would not receive enough power to work reliably. In my case, packets are transmitted but no packets are received. If you ping host X from the Odroid and run tcpdump on host X, you'd see the ICMP echo and reply, but Odroid C1 never sees the reply.

    You will have to disconnect the serial console's power even if you plug in the DC power, or the Odroid will continue to draw power from the serial port.

    First I need to flash the eMMC from Odroid's image to ensure the proper bootloader and the partition layout. It would be nice in the future if they could provide a tool to format the eMMC.
    • From: http://odroid.com/dokuwiki/doku.php?id=en:c1_release_linux_ubuntu
    • unxz --keep ubuntu-14.04.1lts-lubuntu-odroid-c1-20150102.img.xz
    • On Mac OS X:
      • Use "diskutil list" to find the eMMC device.
      • Use "diskutil umount" to unmount whichever volume is currently mounted, prior to flashing with the dd command.
      • Use "diskutil eject" to eject the device after the flashing is done.
    • dd if=ubuntu-14.04.1lts-lubuntu-odroid-c1-20150102.img of=/dev/rdisk2 bs=1M
    This will result in an eMMC with two partitions. Mac OS X or Windows will see only the boot (FAT32) partition. While you're at it, edit boot.ini and change bootargs from:
    setenv bootargs "console=ttyS0,115200n8 root=UUID=e139ce78-9841-40fe-8823-96a304a09859 rootwait ro no_console_suspend vdaccfg=0xa000 logo=osd1,loaded,0x7900000,720p,full dmfc=3 cvbsmode=576cvbs hdmimode=${m} m_bpp=${m_bpp} vout=${vout_mode} ${disableuhs}"
    setenv bootargs "console=ttyS0,115200n8 root=/dev/mmcblk0p2 rootwait ro no_console_suspend vdaccfg=0xa000 logo=osd1,loaded,0x7900000,720p,full dmfc=3 cvbsmode=576cvbs hdmimode=${m} m_bpp=${m_bpp} vout=${vout_mode} ${disableuhs}"
    That's because we will reformat the filesystem which results in a different UUID. It is also fixable later.

    We will now install Debian using Odroid's uImage with the uInitrd of the Debian network installer for efikamx, which you could download from here. I don't know if other installers under armhf work or not; maybe I just got lucky. You can either put the uInitrd under a different name (say uInitrd-deb) into the boot partition (the FAT32 partition that has boot.ini) or TFTP it in later. But we will first boot into Odroid's Xubuntu to resize the root filesystem since Debian installer can't repartition it.

    Don't plug in the Ethernet cable the first time, until you change the password for 'root' and 'odroid' over the serial console. Also, while you're at it, change /etc/ssh/sshd_config to disallow root login.
    # passwd -d root
    # passwd odroid
    Enter new UNIX password: 
    Retype new UNIX password: 
    passwd: password updated successfully
    # sed -i 's/\(PermitRootLogin\) .*/\1 no/' /etc/ssh/sshd_config
    # service ssh reload
    This ensures that nobody could login as root and replace our kernel before we're ready to install Debian. We will be keeping the same kernel for now. Run odroid-utility.sh now to resize the root filesystem to fill the eMMC.

    Reboot the Odroid, and press the enter key immediately to abort U-Boot. You will see the "odroidc#" prompt. Now it is safe to plug in the Ethernet cable.
    * Welcome to Hardkernel's ODROID-C... (Built at 19:33:00 Dec  8 2014) *
    PU : AMLogic S805
    MEM : 1024MB (DDR3@792MHz)
    BID : HKC13C0001
    S/N : HKC1CC0349B9915D
    Loading U-boot...success.
    U-boot-00000-g9f5aee4(odroidc@odroidc-v2011.03) (Jan 02 2015 - 17:56:06)
    DRAM:  1 GiB
    relocation Offset is: 2ff0c000
    MMC:   eMMC: 0, SDCARD: 1
    IR init is done!
    vpu clk_level = 3
    set vpu clk: 182150000Hz, readback: 182150000Hz(0x701)
    mode = 6  vic = 4
    set HDMI vic: 4
    mode is: 6
    viu chan = 1
    config HPLL
    config HPLL done
    reconfig packet setting done
    MMC read: dev # 0, block # 33984, count 12288 ... 12288 blocks read: OK
    There is no valid bmp file at the given address
    Vendor: Man 450100 Snr 01281340 Rev: 4.7 Prod: SDW32
                Type: Removable Hard Disk
                Capacity: 29820.0 MB = 29.1 GB (61071360 x 512)
    Partition     Start Sector     Num Sectors     Type
        1                 3072          263168       6
        2               266240         9363456      83
    Net:   Meson_Ethernet
    init suspend firmware done. (ret:0)
    Hit Enter key to stop autoboot -- :  1 tstc enter
    exit abortboot: 1
    Here we will load the original uImage and boot into uInitrd-deb of the Debian installer. You could just TFTP in the Debian installer uInitrd using the tftpload command (not shown).
    odroidc#fatload mmc 0:1 0x21000000 uImage
    reading uImage
    5434708 bytes read
    odroidc#fatload mmc 0:1 0x22000000 uInitrd-deb
    reading uInitrd-deb
    5473116 bytes read
    odroidc#fatload mmc 0:1 0x21800000 meson8b_odroidc.dtb
    reading meson8b_odroidc.dtb
    17748 bytes read
    odroidc#setenv bootargs "console=ttyS0,115200n8 rootwait ro no_console_suspend vdaccfg=0xa000 logo=osd1,loaded,0x7900000,720p,full dmfc=3 cvbsmode=576cvbs hdmimode=720p m_bpp=32 vout=hdmi disableuhs"
    odroidc#bootm 0x21000000 0x22000000 0x21800000
    The bootargs contain some magic incantation that tells the board how to activate the frame buffer. Without them you'd get a kernel panic. The bootargs and other load commands all come from /boot/boot.ini on the eMMC card. After you boot, you should be able to continue with the usual Debian Installer. Make sure that you use /dev/mmcblk0p2 as your root.

    If you forgot to fix boot.ini to use root=/dev/mmcblk0p2 instead of the UUID earlier, initramfs will drop to an "(initramfs)" prompt after failing to find root. The following commands fix it by assigning the new root filesystem the same UUID as before:
    mount /dev/mmcblk0p2 /root
    mount /dev /root/dev
    chroot /root
    tune2fs -U e139ce78-9841-40fe-8823-96a304a09859 /dev/mmcblk0p2
    The resulting system won't have any kernel modules, but the builtin kernel is functional enough without additional modules. You will be able to install the modules over the network after the system boots normally. See my next post on how to compile a kernel for Odroid C1.

    Wednesday, January 7, 2015

    Load Balancing Methods and Trade-Offs

    Load balancing is the act of distributing service requests from clients to multiple servers that can perform a service. There are many excellent articles about load balancing, particularly the mechanisms used for the web, but few understand the trade-offs. The mechanisms can be roughly classified as client-side or server-side load balancing.
    • In client-side load balancing, the client somehow obtains a list of possible server it may use, and it implements the smart to decide how to distribute its own requests among the list of servers it has. Round-robin DNS is such an example, where the client receives a list of servers over DNS query. Very little additional infrastructure is needed other than configuration changes.
    • In server-side load balancing, the client reaches a server-side load balancer that distributes requests to backend servers. For what the client cares, the load balancer is simply one gigantic server. Some examples are IP Virtual Server and HTTP Reverse Proxy. They require heavier infrastructure investment.
    In production, a mix of client and server side load balancing are often used. It's not hard to see why, if we consider the impact of load balancing at the client and at the server. Here we consider just latency and throughput.

    Client-side load balancing tends to suffer worse latency than server-side load balancing. The reason is because the client often doesn't have visibility on server availability or load. If a server is overloaded or is offline, then the client may have to wait for a timeout before it tries another server. Propagating server load information to the client does not improve latency because the propagation delay only adds to the overall service time. Server availability is also highly dynamic, so a client cannot reuse this information for an extended period of time. On the other hand, a server-side load balancer is much closer to the backend servers, which means it could obtain server availability much faster. Also, the balancer can amortize the cost of querying server availability over many requests. Server-side load balancing, therefore, incurs no latency penalty to the client.

    Client-side load balancing tends to enjoy greater throughput than server-side load balancing. That's because the network path for each client-server path could potentially take different routes. Of course, if a client tries to contact many servers, it would saturate its own uplink first, but if a server's uplink is saturated, we can easily add more servers and more uplinks. On the other hand, a server-side load balancer becomes a unique destination for many clients, so its uplink can be easily saturated. If it is the case that requests are smaller than the responses (which is often the case but not always), then Direct Routing can alleviate some of the load-balancer throughput bottleneck by having the backends return the response directly to the client, but it doesn't scale as easily as client-side load balancing.

    The solution often used in production is to deploy both client and server side load balancing. A cluster in server-side load balancing makes a destination highly available, which guarantees the best service time for a given client. But client-side load balancing can send a client to multiple such clusters for better throughput.