Wednesday, May 30, 2018

Passwordless Debian Login

I run Debian inside a VirtualBox VM and share my home folder on the host OS, and now I've decided that it's silly to have to enter my password inside the VM all the time. Mainly, I want to auto-login to the console tty, and also convince sudo to let me run commands without entering a password.

I thought this should be as straightforward as googling for a recipe. It turns out the instructions I found on the internet are just wrong and truly awful.

First, to configure auto-login on Debian running systemd the correct way, run systemctl edit getty@tty1 and enter the following:
[Service]
ExecStart=
ExecStart=-/sbin/agetty --autologin USERNAME %I $TERM
Replace USERNAME with your actual username. Here are the ways that this page got it wrong: on Debian, there is no /usr/bin/agetty; it is found under /sbin/agetty. Also, the dangling I in the ExecStart line should have been %I which is the systemd expansion for the instance name. Baud rate should be omitted.

If something goes wrong, just switch to tty2 with Alt-F2 (or Fn-Optoin-F2 on a Mac) and fix the problem. If you have a touch bar, configure it to always show function keys. Otherwise, rebooting with the Debian installer CD into rescue mode always works.

Now, to convince sudo to let me run commands without a password, run visudo and make sure we swap the user and group specifications with an end result like this:
# Allow members of group sudo to execute any command
%sudo ALL=(ALL:ALL) ALL

# User privilege specification
root ALL=(ALL:ALL) ALL
myuser ALL=(ALL:ALL) NOPASSWD: ALL
The space between NOPASSWD: and ALL is optional, and it works either way. What is tricky is that the statements below overrides the ones above. This is counter-intuitive as an access control list, which typically matches top-down. By default, the group specification appears below the user specification, so the group overrides the user, but that makes no sense. It makes more sense for the individual user setting to override the group one.

I don't have a lot of high hopes for a man page that begins with a quick guide to EBNF.

Thursday, April 19, 2018

My Experience with AWS CloudFront and Lambda@Edge

I've got a few classic Google Sites with custom domain, and now I want to enable TLS for them. I thought this would be easy; after all, both App Engine and Blogger now generate managed TLS certificates for custom domains. I thought classic Google Sites would be the same, but I was wrong. Not only is this not supported, but some guy over at the production support forum turned indignant on people seeking help like me.

The basic idea is to use a CDN to terminate TLS. There are many CDN providers, such as AWS CloudFront, Google Cloud CDN, and CloudFlare, and in theory it shouldn't matter which one you use.

AWS CloudFront worked well for my AWS S3 bucket lifecs-static.likai.org which hosts the static assets for this blog. Most S3 buckets can be accessed over HTTPS only when it isn't a custom domain, and that's because the wildcard certificate only covers the immediate *.s3.amazonaws.com but not further subdomains. Fortunately, CloudFront is able to serve from an S3 bucket directly. For other destinations, you have to specify the origin protocol. My hello.likai.org is hosted by nearlyfreespeech.net which can be accessed over HTTPS as likaior.nfshost.com. They do provide a custom script to enable Let's Encrypt, but I still decided to go with CloudFront for TLS termination.

One of the Google Sites that I want to enable TLS for is cs.likai.org which is a classic Google Sites. The site can also be accessed over HTTPS as sites.google.com/a/likai.org/cs, but when accessed in this manner, the links generated in the page have the wrong base URI (the full sites.google.com etc) which won't work when served over a custom domain. On the other hand, Google Sites refuses to serve HTTPS for custom domains, so the origin protocol would have to be plain HTTP. Even so, there is another problem: style sheet and images have HTTP links as well, so when naively served over HTTPS, content policy would block those resources and break the rendering of the site.

As I was playing around, I noticed that we can use AWS Lambda with CloudFront to rewrite requests and responses, also known as Lambda@Edge. This can be used to rewrite the links in the page to the preferred scheme-less form. For example, http://example.com/foo/bar would become //example.com/foo/bar, which means when the base document is served over HTTPS, then the link to example.com will also be HTTPS, but the link would be HTTP for base document served over HTTP. Of course, since you can rewrite the response body with Lambda, I could have used https://sites.google.com/a/likai.org/cs as my origin so I could have end-to-end encryption, and trim the base URI from the response.

Before, cs.likai.org was a CNAME for ghs.google.com. Obviously, I could not tell CloudFront to both use cs.likai.org as origin and then point cs.likai.org as a CNAME back to CloudFront, as this will create an infinite request loop, so I created a new mapping cs-http.likai.org in the Google Apps admin console. The plan is to serve cs.likai.org with CloudFront, which fetches cs-http.likai.org over unencrypted HTTP. It would have been easier if CloudFront lets me use ghs.google.com as the origin and let me override the Host header, but it wouldn't let me.

I wrote the following script for the Lambda (feel free to adapt it for your own use), to be associated as CloudFront Origin Response handler.
'use strict'

const cheerio = require('cheerio')
const url = require('url')

function stripLinkProtocol(attrname, elem) {
  const $ = cheerio
  // cheerio uses htmlparser2 and domhandler internally, so elem is a
  // domhandler DOM-like object.
  const link = $(elem).attr(attrname) // $.attr() is O(n), beware!
  const parsed = url.parse(link)
  if (!parsed || !parsed.protocol)
    return
  const newlink = link.substr(parsed.protocol.length)
  $(elem).attr(attrname, newlink) // $.attr() is O(n), beware!
}

function stripHtmlLinks(html) {
  const $ = cheerio.load(html)
  $('[href]').each((i, elem) => { stripLinkProtocol("href", elem) })
  $('[src]').each((i, elem) => { stripLinkProtocol("src", elem) })
  $('form').each((i, elem) => { stripLinkProtocol("action", elem) })
  return $.html()
}

exports.handler = (event, context, callback) => {
  const response = event.Records[0].cf.response;
  if (response.status == 200 &&
    response.headers['content-type'] &&
    response.headers['content-type'][0].value.startsWith('text/html') &&
    response.body) {
    response.body = stripHtmlLinks(response.body)
    response.headers['x-lambda-edge'] = [
      {'key': 'X-Lambda-Edge', 'value': 'html-strip-link-protocol'}
    ]
  }
  callback(null, response)
}
But deploying it wasn't as straightforward, so here are some highlights for the surprises I encountered along the way:
  • Lambda@Edge only supports NodeJS 6.10 (as of April 19, 2018) even though AWS Lambda in general supports other runtimes.
  • The IAM role for Lambda@Edge needs additional trust relationship (but not the IAM permissions, which are for associating the Lambda with CloudFront).
  • Any external dependencies (such as cheerio) must be prepackaged and uploaded as a naked zip file (added files are not contained in a root folder). File permissions in the zip has to be world readable, or you will get "EACCES: permission denied, open '/var/task/index.js'" error. The extra dependencies can be fetched easily with "npm install ..." command. The main Lambda entry point must be called index.js.
  • You have to publish a version of the Lambda before you can associate it with CloudFront. After association, this version can't be deleted until all replicas have been disassociated.
  • To test the lambda without affecting what's live, create a new CloudFront distribution specific for testing, then attach the testing Lambda version to it. Do note that CloudFront updates are very slow (can take a few hours).
I also created a unit test which helps trying out different things.
{
  "Records": [
    {
      "cf": {
        "config": {
          "distributionId": "EXAMPLE"
        },
        "response": {
          "status": "200",
          "headers": {
            "content-type": [
              {
                "value": "text/html; charset=utf-8",
                "key": "Content-Type"
              }
            ]
          },
          "statusDescription": "OK",
          "body": "<link rel='stylesheet' type='text/css' href='http://www.gstatic.com'> <a href='http://example.com/foo/bar'>"
        }
      }
    }
  ]
}
For some reason, despite the test passing, and the CloudFront had apparently finished deploying with no errors, I still wasn't able to see any evidence that my Lambda ran. The links were left as they were, and there is no indication of the X-Lambda-Edge response header I inserted in my code.

And then I discovered this fine print in Updating HTTP Responses in Origin-Response Triggers buried under the mountains of documentation.
When you’re working with the HTTP response, note that Lambda@Edge does not expose the HTML body that is returned by the origin server to the origin-response trigger. You can generate a static content body by setting it to the desired value, or remove the body inside the function by setting the value to be empty.

I feel that AWS Lambda@Edge is similar to App Engine, in the sense that if you wanted, you could write the whole server in Lambda which has access to backend storage like DynamoDB and S3, and AWS takes care of replication and scaling. But the achilles heel of AWS is the slow replication.

I should have created an App Engine to front my classic Google Sites instead, then I get HTTPS for free.

Friday, January 19, 2018

Embedding a Binary Search Tree in an Array

Recently I started researching what would be the most efficient data structure to represent an IP address lookup table. An IP address table is useful for making routing decisions or access control, which checks whether a single IP address is a member of a range of netblocks. A netblock is represented by a base address and a prefix length, e.g. 192.168.42.0/24 represents a range of IP addresses from 192.168.42.0 through 192.168.42.255. Hash table is not suitable for this purpose because hashes are computed on exact matches. With the given example, this would require entering all 256 addresses in the range into the hash table, which is a significant blowup.

Hardware routers use Ternary Content Addressable Memory which can tell whether a specific value matches anything in the memory, after applying a bitmask. A software implementation might use a prefix tree (trie), which theoretically deduplicates common prefixes among the entries, resulting in a smaller table. However, any data structure employing pointers has the added overhead of the pointers (a machine word) and cache locality problems, which are not considered in theoretical papers.

The most compact data structure for the vast majority of lookup tables is going to be just an array of IP netblocks. For an unsorted array, lookup takes \( O(n) \) time which will be slow for a big table. If we presort the array and use binary search, the lookup takes \( O(\lg n) \) time. But binary search is a pathological case for cache locality because lookups tend to jump back and forth until it settles in to a target. Rather than a binary search which partitions candidates by two parts, it was shown that ternary search (three partition) improves performance, while quaternary search (four partition) improves little over binary search.

It had occurred to me that cache locality for binary search could be improved if we had simply rearranged the presorted array to become an embedding of binary search tree (BST), such that for node at index \( i \), its left and right children are at \( 2i + 1 \) and \( 2i + 2 \) respectively. The scheme is inspired by how binary heap is embedded in an array, but we are embedding a complete BST. The rationale is that the search always occurs left to right in the array, with the most accessed items located at the beginning of the array, which gets the most cache hits.

Here is an example showing how a presorted array with the number sequence \( 0 \dots 9 \) could be embedded.


Note that this embedding works for any presorted array. The values for \(i\) are the indices of the BST embedding, and the values for \(x\) are the indices in the presorted array. A presorted array can always be represented by a complete BST, so the array is dense. However, we can't embed a prefix tree this way because a prefix tree is sparse.

The implementation strategy is to presort the array, create a new array, and copy the items from the presorted array to the new array in the order of the BST embedding. The preprocessing only needs to be done once, and it results in the most compact representation of the lookup table. The assumption is that once the table is preprocessed, it won't need to be changed again.

I implemented the embedded BST in Go and compared its lookup performance with the builtin sort.Search() which performs binary search on a presorted array. The runs are repeated for increasing array sizes in the powers of 2 as well as stride width (both in terms of the number of machine words, not in bytes). The items to be searched are incremented by the stride width for each iteration; small stride approximates sequential lookup, whereas large stride approximates random lookup. Each of the data point is the average number after running some 2,000,000 to 100,000,000 iterations (the number of iterations varies due to the idiosyncrasies of the Go benchmark framework, not my fault; it tries to run each case for about 5 seconds).

The folly of this is that sequential lookup (smaller stride widths) may not cover a significant portion of a very large array size \(2^n\) where \(n \ge 20\), but the experiment is fair because (1) both algorithms are subject under the same condition, and (2) the array size above \(2^{20}\) already exceeds L2 cache size on the machine running the experiment, which is sufficient to demonstrate performance difference due to cache locality. The performance numbers are collected on a late 2013 model of MacBook Pro 13" whose Core i5 "Haswell" I5-4258U CPU runs at 2.4 GHz and has 32KB L1 cache and 3MB L2 cache. This machine has 8GB DDR3 ram. The program is compiled in 64-bits.


[popup figure in a new window]

You can use the drop-down menus to highlight the series for a specific algorithm (binary search, embedded BST) to see how they perform at various stride widths, or to highlight specific stride widths to compare how the two algorithms perform.

Here are some observations.
  • At small array sizes \(2^n\) where \(n \le 7\), all lookups fit in the L1 cache. The two algorithms have identical performance at all stride widths.
  • From array sizes \(2^7\) to \(2^{12}\), the L1 misses start to happen. Array size \(2^{12}\) occupies 32KiB, the size of L1 cache.
  • Array sizes \(2^{12}\) through \(2^{18}\) still fit within the L2 cache, after which the performance starts skyrocketing at the largest stride width (which simulates random access). Even so, embedded BST held up much better than binary search.
  • At small strides, embedded BST has a more predictable performance than binary search across all array sizes.
  • Embedded BST performs at least as well as or better than binary search across all array sizes and strides.
Admittedly, embedded BST is optimized for lookup. After the initial preprocessing, we don't try to insert or remove entries from the table. If a modifiable table is desired, we can relax the density of the array so that the BST only needs to be approximately balanced. This allows us to use any classical balanced BST algorithms like red-black tree or AVL tree. It may be worthwhile to investigate the cache performance of array embedded versions of these algorithms as future work.

Some IP address lookup use cases require inserting and deleting individual addresses, such as for the purpose of intrusion detection. Since they only need to match individual addresses rather than netblocks, a hash table would suffice.

Otherwise, embedded BST is a great data structure for immutable IP address range lookup tables that is initialized once and optimized for lookup. It has the most compact representation and has much improved cache locality over binary search.

Thursday, January 11, 2018

C++, the cause and solution to life's pimpl problems


Can a programming language invite programmers to write anti-patterns that makes them write more code that is also harder to follow? Certainly, it can.

PIMPL is a much hyped code pattern in C++ to separate a class's public declaration from the implementation detail by wrapping a pointer to the implementation object in the public class. This is what the user of the public class sees. The pointer is the only member of the public class.
// In xyzzy.h

class Xyzzy {
 public:
  Xyzzy();   // Constructor.
  ~Xyzzy();  // Destructor.

  void Foo();

 private:
  class Impl;  // Incomplete forward declaration to the implementation.
  Impl *pimpl_;  // Pointer to incomplete class is allowed.
};
The implementation would be written like this, hidden away from the user.
// In xyzzy.c

class Xyzzy::Impl {
  ...
  void Foo() {
    // Actual code.
  }
  ...
  // All private members are declared in the Impl class.
};

Xyzzy::Xyzzy()  // Constructor.
  : pimpl_(new Impl) {}

Xyzzy::~Xyzzy() {  // Destructor.
  delete impl_;
}

void Xyzzy::Foo() {
  impl_->Foo();
}
There are some elaborate concerns like constness preservation, use of smart pointers, copying and moving, memory allocation, and generalizing the boilerplate to a template, but I want to take a step back and look at what PIMPL accomplishes, and why it is necessary.

The main reason for using PIMPL is to reduce compilation dependencies, so that changes made to the implementation class would not force the user of the public interface to have to recompile code. Recompilation is necessary when the change breaks binary compatibility. In C++, a change could break binary compatibility for banal reasons like adding a private member to a class.

When class members are added or removed, the size of the class would change. In C++, it is the caller's responsibility to prepare memory storage for the object, regardless whether the object is stack or heap allocated. Without recompilation, the storage might be too small. Bigger storage is not a problem, but the compiler has no way of telling the old size. With PIMPL, the size of the public class is always just a single pointer to the implementation object, so it stays the same. It shifts the responsibility of storage preparation from the user to the implementation.

Another type of change that warrants recompilation is when virtual methods are added or removed, which changes the ordering of function pointers in the vtable. Without recompilation, the old code would end up calling the wrong virtual method since the index changed. PIMPL would not be able to alleviate this type of recompilation; adding or removing non-virtual methods from the public class would still require recompilation under PIMPL.

One might ask, instead of PIMPL, why not use the abstract base class with a static constructor? It would look something like this for the public interface.
// In xyzzy.h
class Xyzzy {
 public:
  static Xyzzy *New();  // Static constructor.
  virtual ~Xyzzy();  // Virtual destructor.
  virtual void Foo();

 protected:
  Xyzzy();  // Not to be constructed directly by user.
};
And the implementation:
// In xyzzy.c
class XyzzyImpl : public Xyzzy {
 public:
  void ~XyzzyImpl() override {
    ...
  }

  void Foo() override {
    ...
  }

  // Other members.
};

Xyzzy *Xyzzy::New() {
  return new XyzzyImpl;
}
The problem is that the abstract base class forces all methods to be virtual, and virtual method calls are more expensive because it's harder to do branch prediction with function pointers. The responsibility to manage storage is shifted to the static constructor, so adding or removing members to the base class shouldn't affected binary compatibility. Even so, it still requires recompilation in practice, so members should be declared in the implementation class only.

The common theme here is that any change to a class in C++ requires recompilation, period. That's because the compiler can't tell what changed, so it can't tell whether the change may or may not affect binary compatibility. Contrast this with C, where user of a struct never has to know its size, without relying on virtual methods.
// In xyzzy.h

struct xyzzy;  // Incomplete type.

struct xyzzy *new_xyzzy();              // Constructor.
void delete_xyzzy(struct xyzzy *this);  // Destructor.
void xyzzy_foo(struct xyzzy *this);     // Method foo().
Although most build systems would still recompile the translation unit that includes xyzzy.h for posterity, it doesn't have to if the changes are binary compatible. This is why an executable compiled with an older header could still be dynamically linked with a newer version of the shared library without recompilation, and it would still work.

In the end, I think any effort to reduce recompilation for C++ code is futile. C++ is inherently designed for whole-program compilation. There are other reasons, like how template instantiation requires the implementation source code to be available. But any class would require the user to know its size, and one has to go through great lengths to hide that from the user in order to ensure binary compatibility.

Clearly, PIMPL is an anti-pattern that is motivated by the way C++ is designed. It's a problem that C never has to worry about.