Tuesday, December 25, 2018

“Hybrid Log Float” 14-bit to 10-bit Pixel Transfer Function

Here I propose a Hybrid Log Float as a transfer function that maps from a 14-bit linear pixel to 10-bit non-linear pixel in a way that is inspired by floating point numbers. Unlike 16-bit half float, we don't have enough bits for both an exponent and the mantissa. We also don't need to represent fractions or negative numbers. Having a custom integer bit format allows us to store more significant figures while preserving the widest dynamic range from the input. Expressing the transfer function as an integer bit format allows the encoding and decoding between linear and non-linear colorspaces to be fast and bit-accurate without using a lookup table.

First, some background information. In this past year, I've had the pleasure of shooting some videos (herehere, and here) with Panasonic GH5s, which is a fine camera. When it was introduced in January, it was the first compact interchangeable lens camera that could internally record DCI 4K in 4:2:2 10-bit H.264. Such recording ability was only later superseded by the Blackmagic Design Pocket Cinema Camera 4K (BMPCC) released in September, which records in 12-bit CinemaDNG RAW or 4:2:2 10-bit ProRes. These two cameras have a very similar sensor (if not the same) that outputs 14-bit per pixel. GH5s will generate 14-bit RW2 when taking pictures (e.g. time lapse), but not the BMPCC.

The sensor's 14-bit per pixel output means that the maximum theoretical dynamic range is 14 stops. That's because each stop is twice or half the amount of light, so the number of bits per sample corresponds to the number of stops. This is corroborated by the assessment on GH5s done by Alan Roberts in EBU Tech 3335 S29. He measured 14.6 stops for Hybrid Log Gamma and 13.5 for V-Log L. Both Hybrid Log Gamma and V-Log L are different picture profiles that map the 14-bit sensor output to the 10-bit codec. The measured dynamic range exceeds the theoretical somewhat, but it's within measurement error. Also, quantum uncertainty means that light level below the sensor's sensitivity could sometimes be picked up as temporal noise. But as a rule of thumb, we can think of the number of bits as the number of stops.

It is interesting how dynamic range can be affected the way the codec downsamples 14-bit sensor output to 10-bit. The most naive approach is to simply truncate the least significant 4 bits, but that means you lose 16 shades in the shadow and end up with only 10 stops of dynamic range. This is pretty harsh if someone wants to recover shadow from an underexposed picture. This is probably what Like709 picture profile does on GH5s (simulates BT.709 color space) since it only measures 10.3 stops.

The Hybrid Log Gamma (HLG) and V-Log L picture profiles take advantage of the way human eyes perceive intensities of light non-linearly, so they downsample 14-bit sensor output (which is the linear measurement of light) to 10-bit for codec using transfer functions that preserve the most perceptible shades. However, the picture profiles are designed for different purposes. HLG is calibrated for consumer TVs that render HDR in BT.2100 colorspace, which is hard to color grade (changing the look and feel of a picture by manipulating colors) because the colorspace is non-linear. V-Log L has a flatter picture profile which makes color grading easier, but it's still better to convert it to a linear colorspace for more accurate color grading. Non-linear colorspace can be converted back to linear using a look up table (LUT).

Then two things had occurred to me.

First, cameras should really provide a 16-bit codec. If the sensor data are only 14-bits, normalize the sample by zero-padding the least significant bits. The additional bits aren't a significant storage overhead (probably negligible after either lossless or lossy compression), and it makes processing on a computer a lot more efficient since the samples are naturally aligned to 16-bit short integers, which is well suited for SIMD instructions. Unfortunately, the only 16-bit capable codec today is HEVC, and it's not widely implemented. Both ProRes RAWand CineForm are currently 12-bit, even though their coding principles (DCT for ProRes, Wavelet for CineForm) are bit-agnostic, so it's just a matter of someone implementing it. I suspect that 16-bit codec will be much easier to implement. It would avoid this bit shifting voodoo to unpack 14-bit integers.

Second, since a 16-bit capable codec is a bit out of reach, it's beneficial to have a transfer function that is bit-accurate (to minimize rounding errors) but still takes into account the way human perceives light intensities: we tend to notice minute light variances in darkness, but we are less sensitive to small variances when the environment is bright.

The result is the Hybrid Log Float transfer function, which encodes a linear value to logarithmic scale and decodes back to the same linear colorspace with predictable allocation of significant figures. The idea is to encode the exponent as the number of leading one bits in unary, and the rest of the bits can be used for storing significant figures, in a way that's inspired by the UTF-8 encoding. On the other hand, the A-Law, µ-Law companding algorithms count the exponent in binary in the same way Minifloat works (but different number of bits allocated to exponent and mantissa), rather than in unary.

To illustrate how unary exponent works, let's consider 14-bit to 12-bit reduction first.


It's broken down into four spans. The first span (a) has 11 bits of significant figures, and is encoded verbatim as linear. The second span (b) has 10 bits. The last two spans (c) and (d) each have 9 bits, and is actually ½ log(x) = log(√x).

Here is the 14-bit to 10-bit reduction using the same approach.


The spans are (a) 9-bit linear, (b) 8-bit log, (c) 7-bit 1/2 log, (d) 6-bit 1/4 log, (e) 5-bit 1/8 log, (f) 5-bit 1/8 log. Notice that the two highlight tiers only have 5-bits, or 32 shades, which may not be enough to recover highlights. Since this encoding allocates more bits for the shadows, it prefers the picture to be slightly under-exposed.

Here is an alternative 14-bit to 10-bit encoding that reserves more bits for the midtones and highlights, so it has better color accuracy for highlight recovery. The trick is to pre-allocate bits for counting the exponent in binary rather than unary at first, but revert back to unary exponent afterwards.


The difference is that spans (b), (c), and (d) are now 7-bits each, or 128 shades, at 1/2 log scale, using three bits to count the exponent. The upper two tiers (e) and (f) have 6-bits each, or 64 shades, at 1/4 log(x) scale.

Here is a plot comparing:
  • Two variants of Hybrid Log Float (float v1, float v2)
  • Log curve normalized to the output range (73*log2(x))
  • Gamma correction curve at γ = 2.4 normalized to the input range (x ** (1/2.4) * 18) and the output range (x ** (1/2.4) * 57).
  • Linear normalized to the input range (x / 16) and the output range (x).
In the plot, the x axis is the input range, and the y axis is the output. A steeper slope indicates better color resolution. Clipping happens when the curve goes off the chart.


It may seem that log provides too much value resolution for shadow, and gamma 2.4 seems to provide just the right amount. But if we zoom into the input range 0-256, we can see that both log and gamma use much wider output range than the input range, resulting in wasted bits and more difficult noise reduction since the noise is amplified. Both Hybrid Log Float variants follow the same slope as linear x (normalized to output) at this scale, so we don't have the noise amplification problem.


Overall, Hybrid Log Float has the following benefits:
  • It does not waste output bits in the deep shadow, so it doesn't have the noise amplification problem that log or gamma may suffer.
  • Good midtone resolution.
  • It resolves highlights similarly like log for greater dynamic range.
  • The alternative second variant of Hybrid Log Float provides better midtone and highlight resolution than the first variant.
Also most importantly, Hybrid Log Float provides bit-accurate and efficient decoding back to the linear colorspace, so the color grading can be done the most accurately.

Last but not the least, while I would like to see cameras record to Hybrid Log Float when using a legacy 10-bit codec, ultimately I would like to see native 16-bit codecs so the post-processing does not have to waste precious machine cycles realigning the awkward 10-bit integers.

Wednesday, May 30, 2018

Passwordless Debian Login

I run Debian inside a VirtualBox VM and share my home folder on the host OS, and now I've decided that it's silly to have to enter my password inside the VM all the time. Mainly, I want to auto-login to the console tty, and also convince sudo to let me run commands without entering a password.

I thought this should be as straightforward as googling for a recipe. It turns out the instructions I found on the internet are just wrong and truly awful.

First, to configure auto-login on Debian running systemd the correct way, run systemctl edit getty@tty1 and enter the following:
[Service]
ExecStart=
ExecStart=-/sbin/agetty --autologin USERNAME %I $TERM
Replace USERNAME with your actual username. Here are the ways that this page got it wrong: on Debian, there is no /usr/bin/agetty; it is found under /sbin/agetty. Also, the dangling I in the ExecStart line should have been %I which is the systemd expansion for the instance name. Baud rate should be omitted.

If something goes wrong, just switch to tty2 with Alt-F2 (or Fn-Optoin-F2 on a Mac) and fix the problem. If you have a touch bar, configure it to always show function keys. Otherwise, rebooting with the Debian installer CD into rescue mode always works.

Now, to convince sudo to let me run commands without a password, run visudo and make sure we swap the user and group specifications with an end result like this:
# Allow members of group sudo to execute any command
%sudo ALL=(ALL:ALL) ALL

# User privilege specification
root ALL=(ALL:ALL) ALL
myuser ALL=(ALL:ALL) NOPASSWD: ALL
The space between NOPASSWD: and ALL is optional, and it works either way. What is tricky is that the statements below overrides the ones above. This is counter-intuitive as an access control list, which typically matches top-down. By default, the group specification appears below the user specification, so the group overrides the user, but that makes no sense. It makes more sense for the individual user setting to override the group one.

I don't have a lot of high hopes for a man page that begins with a quick guide to EBNF.

New (2020/03/19): tell systemd to use sulogin --force when root account is locked. If a service fails to start during init, systemd would fallback into rescue mode by running sulogin, but sulogin would print this message, and systemd would enter an infinite loop.
Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.

Press Enter to continue.
Unfortunately this is very common. Either a new kernel does not have the vboxsf module compiled for it (since the VirtualBox guest additions modules are not manage by DKMS), or a new VirtualBox version changed the protocol so the existing vboxsf module couldn't talk to the host.

The solution is to convince systemd to use sulogin --force. Run systemctl edit rescue.service and enter the following:
[Service]
Environment=SYSTEMD_SULOGIN_FORCE=1
This should automatically drop down to root if something goes wrong.

Thursday, April 19, 2018

My Experience with AWS CloudFront and Lambda@Edge

I've got a few classic Google Sites with custom domain, and now I want to enable TLS for them. I thought this would be easy; after all, both App Engine and Blogger now generate managed TLS certificates for custom domains. I thought classic Google Sites would be the same, but I was wrong. Not only is this not supported, but some guy over at the production support forum turned indignant on people seeking help like me.

The basic idea is to use a CDN to terminate TLS. There are many CDN providers, such as AWS CloudFront, Google Cloud CDN, and CloudFlare, and in theory it shouldn't matter which one you use.

AWS CloudFront worked well for my AWS S3 bucket lifecs-static.likai.org which hosts the static assets for this blog. Most S3 buckets can be accessed over HTTPS only when it isn't a custom domain, and that's because the wildcard certificate only covers the immediate *.s3.amazonaws.com but not further subdomains. Fortunately, CloudFront is able to serve from an S3 bucket directly. For other destinations, you have to specify the origin protocol. My hello.likai.org is hosted by nearlyfreespeech.net which can be accessed over HTTPS as likaior.nfshost.com. They do provide a custom script to enable Let's Encrypt, but I still decided to go with CloudFront for TLS termination.

One of the Google Sites that I want to enable TLS for is cs.likai.org which is a classic Google Sites. The site can also be accessed over HTTPS as sites.google.com/a/likai.org/cs, but when accessed in this manner, the links generated in the page have the wrong base URI (the full sites.google.com etc) which won't work when served over a custom domain. On the other hand, Google Sites refuses to serve HTTPS for custom domains, so the origin protocol would have to be plain HTTP. Even so, there is another problem: style sheet and images have HTTP links as well, so when naively served over HTTPS, content policy would block those resources and break the rendering of the site.

As I was playing around, I noticed that we can use AWS Lambda with CloudFront to rewrite requests and responses, also known as Lambda@Edge. This can be used to rewrite the links in the page to the preferred scheme-less form. For example, http://example.com/foo/bar would become //example.com/foo/bar, which means when the base document is served over HTTPS, then the link to example.com will also be HTTPS, but the link would be HTTP for base document served over HTTP. Of course, since you can rewrite the response body with Lambda, I could have used https://sites.google.com/a/likai.org/cs as my origin so I could have end-to-end encryption, and trim the base URI from the response.

Before, cs.likai.org was a CNAME for ghs.google.com. Obviously, I could not tell CloudFront to both use cs.likai.org as origin and then point cs.likai.org as a CNAME back to CloudFront, as this will create an infinite request loop, so I created a new mapping cs-http.likai.org in the Google Apps admin console. The plan is to serve cs.likai.org with CloudFront, which fetches cs-http.likai.org over unencrypted HTTP. It would have been easier if CloudFront lets me use ghs.google.com as the origin and let me override the Host header, but it wouldn't let me.

I wrote the following script for the Lambda (feel free to adapt it for your own use), to be associated as CloudFront Origin Response handler.
'use strict'

const cheerio = require('cheerio')
const url = require('url')

function stripLinkProtocol(attrname, elem) {
  const $ = cheerio
  // cheerio uses htmlparser2 and domhandler internally, so elem is a
  // domhandler DOM-like object.
  const link = $(elem).attr(attrname) // $.attr() is O(n), beware!
  const parsed = url.parse(link)
  if (!parsed || !parsed.protocol)
    return
  const newlink = link.substr(parsed.protocol.length)
  $(elem).attr(attrname, newlink) // $.attr() is O(n), beware!
}

function stripHtmlLinks(html) {
  const $ = cheerio.load(html)
  $('[href]').each((i, elem) => { stripLinkProtocol("href", elem) })
  $('[src]').each((i, elem) => { stripLinkProtocol("src", elem) })
  $('form').each((i, elem) => { stripLinkProtocol("action", elem) })
  return $.html()
}

exports.handler = (event, context, callback) => {
  const response = event.Records[0].cf.response;
  if (response.status == 200 &&
    response.headers['content-type'] &&
    response.headers['content-type'][0].value.startsWith('text/html') &&
    response.body) {
    response.body = stripHtmlLinks(response.body)
    response.headers['x-lambda-edge'] = [
      {'key': 'X-Lambda-Edge', 'value': 'html-strip-link-protocol'}
    ]
  }
  callback(null, response)
}
But deploying it wasn't as straightforward, so here are some highlights for the surprises I encountered along the way:
  • Lambda@Edge only supports NodeJS 6.10 (as of April 19, 2018) even though AWS Lambda in general supports other runtimes.
  • The IAM role for Lambda@Edge needs additional trust relationship (but not the IAM permissions, which are for associating the Lambda with CloudFront).
  • Any external dependencies (such as cheerio) must be prepackaged and uploaded as a naked zip file (added files are not contained in a root folder). File permissions in the zip has to be world readable, or you will get "EACCES: permission denied, open '/var/task/index.js'" error. The extra dependencies can be fetched easily with "npm install ..." command. The main Lambda entry point must be called index.js.
  • You have to publish a version of the Lambda before you can associate it with CloudFront. After association, this version can't be deleted until all replicas have been disassociated.
  • To test the lambda without affecting what's live, create a new CloudFront distribution specific for testing, then attach the testing Lambda version to it. Do note that CloudFront updates are very slow (can take a few hours).
I also created a unit test which helps trying out different things.
{
  "Records": [
    {
      "cf": {
        "config": {
          "distributionId": "EXAMPLE"
        },
        "response": {
          "status": "200",
          "headers": {
            "content-type": [
              {
                "value": "text/html; charset=utf-8",
                "key": "Content-Type"
              }
            ]
          },
          "statusDescription": "OK",
          "body": "<link rel='stylesheet' type='text/css' href='http://www.gstatic.com'> <a href='http://example.com/foo/bar'>"
        }
      }
    }
  ]
}
For some reason, despite the test passing, and the CloudFront had apparently finished deploying with no errors, I still wasn't able to see any evidence that my Lambda ran. The links were left as they were, and there is no indication of the X-Lambda-Edge response header I inserted in my code.

And then I discovered this fine print in Updating HTTP Responses in Origin-Response Triggers buried under the mountains of documentation.
When you’re working with the HTTP response, note that Lambda@Edge does not expose the HTML body that is returned by the origin server to the origin-response trigger. You can generate a static content body by setting it to the desired value, or remove the body inside the function by setting the value to be empty.

I feel that AWS Lambda@Edge is similar to App Engine, in the sense that if you wanted, you could write the whole server in Lambda which has access to backend storage like DynamoDB and S3, and AWS takes care of replication and scaling. But the achilles heel of AWS is the slow replication.

I should have created an App Engine to front my classic Google Sites instead, then I get HTTPS for free.

Friday, January 19, 2018

Embedding a Binary Search Tree in an Array

Recently I started researching what would be the most efficient data structure to represent an IP address lookup table. An IP address table is useful for making routing decisions or access control, which checks whether a single IP address is a member of a range of netblocks. A netblock is represented by a base address and a prefix length, e.g. 192.168.42.0/24 represents a range of IP addresses from 192.168.42.0 through 192.168.42.255. Hash table is not suitable for this purpose because hashes are computed on exact matches. With the given example, this would require entering all 256 addresses in the range into the hash table, which is a significant blowup.

Hardware routers use Ternary Content Addressable Memory which can tell whether a specific value matches anything in the memory, after applying a bitmask. A software implementation might use a prefix tree (trie), which theoretically deduplicates common prefixes among the entries, resulting in a smaller table. However, any data structure employing pointers has the added overhead of the pointers (a machine word) and cache locality problems, which are not considered in theoretical papers.

The most compact data structure for the vast majority of lookup tables is going to be just an array of IP netblocks. For an unsorted array, lookup takes \( O(n) \) time which will be slow for a big table. If we presort the array and use binary search, the lookup takes \( O(\lg n) \) time. But binary search is a pathological case for cache locality because lookups tend to jump back and forth until it settles in to a target. Rather than a binary search which partitions candidates by two parts, it was shown that ternary search (three partition) improves performance, while quaternary search (four partition) improves little over binary search.

It had occurred to me that cache locality for binary search could be improved if we had simply rearranged the presorted array to become an embedding of binary search tree (BST), such that for node at index \( i \), its left and right children are at \( 2i + 1 \) and \( 2i + 2 \) respectively. The scheme is inspired by how binary heap is embedded in an array, but we are embedding a complete BST. The rationale is that the search always occurs left to right in the array, with the most accessed items located at the beginning of the array, which gets the most cache hits.

Here is an example showing how a presorted array with the number sequence \( 0 \dots 9 \) could be embedded.


Note that this embedding works for any presorted array. The values for \(i\) are the indices of the BST embedding, and the values for \(x\) are the indices in the presorted array. A presorted array can always be represented by a complete BST, so the array is dense. However, we can't embed a prefix tree this way because a prefix tree is sparse.

The implementation strategy is to presort the array, create a new array, and copy the items from the presorted array to the new array in the order of the BST embedding. The preprocessing only needs to be done once, and it results in the most compact representation of the lookup table. The assumption is that once the table is preprocessed, it won't need to be changed again.

I implemented the embedded BST in Go and compared its lookup performance with the builtin sort.Search() which performs binary search on a presorted array. The runs are repeated for increasing array sizes in the powers of 2 as well as stride width (both in terms of the number of machine words, not in bytes). The items to be searched are incremented by the stride width for each iteration; small stride approximates sequential lookup, whereas large stride approximates random lookup. Each of the data point is the average number after running some 2,000,000 to 100,000,000 iterations (the number of iterations varies due to the idiosyncrasies of the Go benchmark framework, not my fault; it tries to run each case for about 5 seconds).

The folly of this is that sequential lookup (smaller stride widths) may not cover a significant portion of a very large array size \(2^n\) where \(n \ge 20\), but the experiment is fair because (1) both algorithms are subject under the same condition, and (2) the array size above \(2^{20}\) already exceeds L2 cache size on the machine running the experiment, which is sufficient to demonstrate performance difference due to cache locality. The performance numbers are collected on a late 2013 model of MacBook Pro 13" whose Core i5 "Haswell" I5-4258U CPU runs at 2.4 GHz and has 32KB L1 cache and 3MB L2 cache. This machine has 8GB DDR3 ram. The program is compiled in 64-bits.


[popup figure in a new window]

You can use the drop-down menus to highlight the series for a specific algorithm (binary search, embedded BST) to see how they perform at various stride widths, or to highlight specific stride widths to compare how the two algorithms perform.

Here are some observations.
  • At small array sizes \(2^n\) where \(n \le 7\), all lookups fit in the L1 cache. The two algorithms have identical performance at all stride widths.
  • From array sizes \(2^7\) to \(2^{12}\), the L1 misses start to happen. Array size \(2^{12}\) occupies 32KiB, the size of L1 cache.
  • Array sizes \(2^{12}\) through \(2^{18}\) still fit within the L2 cache, after which the performance starts skyrocketing at the largest stride width (which simulates random access). Even so, embedded BST held up much better than binary search.
  • At small strides, embedded BST has a more predictable performance than binary search across all array sizes.
  • Embedded BST performs at least as well as or better than binary search across all array sizes and strides.
Admittedly, embedded BST is optimized for lookup. After the initial preprocessing, we don't try to insert or remove entries from the table. If a modifiable table is desired, we can relax the density of the array so that the BST only needs to be approximately balanced. This allows us to use any classical balanced BST algorithms like red-black tree or AVL tree. It may be worthwhile to investigate the cache performance of array embedded versions of these algorithms as future work.

Some IP address lookup use cases require inserting and deleting individual addresses, such as for the purpose of intrusion detection. Since they only need to match individual addresses rather than netblocks, a hash table would suffice.

Otherwise, embedded BST is a great data structure for immutable IP address range lookup tables that is initialized once and optimized for lookup. It has the most compact representation and has much improved cache locality over binary search.

Thursday, January 11, 2018

C++, the cause and solution to life's pimpl problems


Can a programming language invite programmers to write anti-patterns that makes them write more code that is also harder to follow? Certainly, it can.

PIMPL is a much hyped code pattern in C++ to separate a class's public declaration from the implementation detail by wrapping a pointer to the implementation object in the public class. This is what the user of the public class sees. The pointer is the only member of the public class.
// In xyzzy.h

class Xyzzy {
 public:
  Xyzzy();   // Constructor.
  ~Xyzzy();  // Destructor.

  void Foo();

 private:
  class Impl;  // Incomplete forward declaration to the implementation.
  Impl *pimpl_;  // Pointer to incomplete class is allowed.
};
The implementation would be written like this, hidden away from the user.
// In xyzzy.c

class Xyzzy::Impl {
  ...
  void Foo() {
    // Actual code.
  }
  ...
  // All private members are declared in the Impl class.
};

Xyzzy::Xyzzy()  // Constructor.
  : pimpl_(new Impl) {}

Xyzzy::~Xyzzy() {  // Destructor.
  delete impl_;
}

void Xyzzy::Foo() {
  impl_->Foo();
}
There are some elaborate concerns like constness preservation, use of smart pointers, copying and moving, memory allocation, and generalizing the boilerplate to a template, but I want to take a step back and look at what PIMPL accomplishes, and why it is necessary.

The main reason for using PIMPL is to reduce compilation dependencies, so that changes made to the implementation class would not force the user of the public interface to have to recompile code. Recompilation is necessary when the change breaks binary compatibility. In C++, a change could break binary compatibility for banal reasons like adding a private member to a class.

When class members are added or removed, the size of the class would change. In C++, it is the caller's responsibility to prepare memory storage for the object, regardless whether the object is stack or heap allocated. Without recompilation, the storage might be too small. Bigger storage is not a problem, but the compiler has no way of telling the old size. With PIMPL, the size of the public class is always just a single pointer to the implementation object, so it stays the same. It shifts the responsibility of storage preparation from the user to the implementation.

Another type of change that warrants recompilation is when virtual methods are added or removed, which changes the ordering of function pointers in the vtable. Without recompilation, the old code would end up calling the wrong virtual method since the index changed. PIMPL would not be able to alleviate this type of recompilation; adding or removing non-virtual methods from the public class would still require recompilation under PIMPL.

One might ask, instead of PIMPL, why not use the abstract base class with a static constructor? It would look something like this for the public interface.
// In xyzzy.h
class Xyzzy {
 public:
  static Xyzzy *New();  // Static constructor.
  virtual ~Xyzzy();  // Virtual destructor.
  virtual void Foo();

 protected:
  Xyzzy();  // Not to be constructed directly by user.
};
And the implementation:
// In xyzzy.c
class XyzzyImpl : public Xyzzy {
 public:
  void ~XyzzyImpl() override {
    ...
  }

  void Foo() override {
    ...
  }

  // Other members.
};

Xyzzy *Xyzzy::New() {
  return new XyzzyImpl;
}
The problem is that the abstract base class forces all methods to be virtual, and virtual method calls are more expensive because it's harder to do branch prediction with function pointers. The responsibility to manage storage is shifted to the static constructor, so adding or removing members to the base class shouldn't affected binary compatibility. Even so, it still requires recompilation in practice, so members should be declared in the implementation class only.

The common theme here is that any change to a class in C++ requires recompilation, period. That's because the compiler can't tell what changed, so it can't tell whether the change may or may not affect binary compatibility. Contrast this with C, where user of a struct never has to know its size, without relying on virtual methods.
// In xyzzy.h

struct xyzzy;  // Incomplete type.

struct xyzzy *new_xyzzy();              // Constructor.
void delete_xyzzy(struct xyzzy *this);  // Destructor.
void xyzzy_foo(struct xyzzy *this);     // Method foo().
Although most build systems would still recompile the translation unit that includes xyzzy.h for posterity, it doesn't have to if the changes are binary compatible. This is why an executable compiled with an older header could still be dynamically linked with a newer version of the shared library without recompilation, and it would still work.

In the end, I think any effort to reduce recompilation for C++ code is futile. C++ is inherently designed for whole-program compilation. There are other reasons, like how template instantiation requires the implementation source code to be available. But any class would require the user to know its size, and one has to go through great lengths to hide that from the user in order to ensure binary compatibility.

Clearly, PIMPL is an anti-pattern that is motivated by the way C++ is designed. It's a problem that C never has to worry about.