Monday, June 13, 2022

Two-Liner Version String Sorting in Python

Version sorting is sometimes called natural sorting as well. There exists a natsort package that caters to the need, and it has more than 1000 lines of code. What if I tell you we can do it in two lines?

Version sorting is the problem that a version string "1.2.1" should be less than "1.10.1", but string comparison puts "1.10.1" before "1.2.1" because the prefix "1.1" < "1.2". Version strings can be arbitrary complex, e.g. "1.2.3beta1" should be less than "1.2.10beta3" or it could be punctuated like "1.2.3-rc3".

The gist of it is that if any part of the string looks like a number, then compare it like a number. Here is the code:

import re

def version_key(key: str) -> tuple:
  """Returns a sortable key in version order."""
  items = re.split(r'(\d+)', key)
  return tuple(int(item) if item.isdigit() else item for item in items)

And it can be used like this:

>>> a = ['version-1.9', 'version-2.0', 'version-1.11', 'version-1.10']
>>> sorted(a, key=version_key)
['version-1.9', 'version-1.10', 'version-1.11', 'version-2.0']

So how does it work? When re.split() is given a delimiter pattern that contains capturing groups, it also returns the capturing group matches in the result list. Effectively, we get a tokenizer that can split numbers and non-numbers into separate tokens.

>>> re.split(r'(\d+)', '1.10.2')
['', '1', '.', '10', '.', '2', '']

Then, the second line converts any string in the list that looks like a number into an integer.

>>> items = ['', '1', '.', '10', '.', '2', '']
>>> tuple(int(item) if item.isdigit() else item for item in items)
('', 1, '.', 10, '.', 2, '')

When this tuple is being used by sorted() for comparison, Python would compare the tuple item-wise. The leading empty string means the original string begins with a number. It is important to keep it so that the item types are aligned to be (str, int, str, int, ..., str) in any version key tuple. If we remove the empty strings, pairwise comparison may end up comparing a str and int, which would be an error.

As a result, we can compare versions that are formatted differently, and it would still work.

>>> a = ['version-1.9', '2.0', 'version-1.11', '1.10']
>>> sorted(a, key=version_key)
['1.10', '2.0', 'version-1.9', 'version-1.11']

Of course, natsort does a lot more. It also has options to extract a floating point number or a human readable comma-delimited number. The working principle will be very similar, just replace the pattern given to re.split() with what you want to be interpreted as a "number" and somehow convert it to a number.

But for sorting a version string, two lines of code is all you need.

Sunday, June 12, 2022

Designing a Data Breach Resilient Database

Rainbow crepe cake, by The Cooking of Joy.

With several high profile data breach settlements—such as Equifax and Morgan Stanley—coming to a conclusion, it begs the question: how do we prevent these data breaches from happening in the first place?

In the case of Equifax, attackers gained unauthorized access via their dispute portal website which allowed them to discover credential to the central database keeping the customer data, and they slowly exfiltrated that data over 76 days to avoid detection.

In the case of Morgan Stanley, the data breach was caused by improper disposal of used hard drives containing customer data.

Security is always as strong as the weakest link, so if we want to prevent future data breach, we have to understand how the storage stack works. It's helpful to think of storage as a layered model like how the OSI model (Open Systems Interconnection) helps us understand networking. We can tentatively call this the DSI model (Data Storage Interconnection). There is a straightforward analogy of the layers between these two models.

To recap, here are the OSI layers.

OSI model

7*Application LayerA website or app that the user interacts with.
6*Presentation LayerInterpretation of the data, e.g. HTML, JSON, UTF-8.
5*Session LayerDefinition of request and response semantics, e.g. HTTP, gRPC.
4Transport LayerData transmission and flow control, e.g. TCP, UDP.
3Network LayerRouting of data between networks, e.g. IPv4, IPv6.
2Datalink LayerNode identification on a local network, e.g. Ethernet MAC.
1Physical LayerModulation of bits on a physical medium, e.g. 10GBASE-T.

* Note: The original OSI model specification only specified the purpose of each layer but not concrete examples of standards. Traditionally, HTTP is considered an "application" known as the web browser, but these days HTTP is also used for a transport like gRPC, which pushes it down to a lower layer. My definition here reflects the modern web2 or web3 stacking.

DSI model
7Application LayerA website or app that the user interacts with.
6Presentation LayerInterpretation of the data, e.g. HTML, JSON, UTF-8.
5Schema LayerModeling of data, e.g. table definitions.
4Database LayerTransaction and indexing, e.g. SQL.
3Filesystem LayerAbstraction for files and directories, e.g. ZFS, NTFS, AWS S3.
2Block LayerBlock devices, e.g. SCSI, SATA.
1Physical LayerModulation of bits on a physical medium, e.g. SMR.

In a similar fashion, we can define data storage layers that are roughly analogous to the OSI layers. Application Layer and Presentation Layer are the same, but layers 5 and below are redefined for data storage purposes. Physical Layer has the same purpose again.

(As a side note, the Presentation Layer and Schema/Session Layer are typically where difficulties may arise when sharing data between different companies, for example to provide universal access to health records, and this is true both in the networking and data storage sense. But that's a topic for another day.)

As we can see, the Morgan Stanley data breach happened at the Physical Layer. The Equifax breach started at the Application Layer, but the data is exfiltrated at the Database Layer. However, data breach can happen at other layers too. Here is a summary of risk factors at each layer.

DSI attack vectors
7Application LayerA vulnerable website allowing access to lower layer credentials, e.g. Equifax.
6Presentation LayerCross-site scripting, where malicious code is injected through data.
5Schema LayerEnumeration attack due to poorly designed data access policies.
4Database LayerExfiltration of data (e.g. Equifax), SQL injection attack.
3Filesystem LayerLocal privilege escalation.
2Block LayerNetwork attached storage in an unsecured network or using an insecure protocol.
1Physical LayerTheft or improper disposal of physical medium, e.g. Morgan Stanley.

* Note: as new attack techniques become known, I'll add them to the correct layer here.

It is tempting to try to put all security burden on the Application Layer, but it is unrealistic to expect that a single layer can handle all the attacks thrown at it. Again, since security is only as strong as the weakest link, we have to fortify each layer.

Application Layer should impose a strict separation of code and data. When possible, the application should be implemented using a compiled language with code placed in a read-only filesystem (see Revival of Harvard Architecture). The application should be shielded behind a monitoring system for detecting unusual access patterns, but reconnaissance alone is not enough. It is often already too late when the exploit is detected, but we still want to have real-time signal for exploit attempts.

Presentation Layer should escape data consistently when inserting one piece of data into another. User-supplied HTML should be parsed and scrubbed to keep only allowlisted elements. A proper templating library should be able to handle that. Don't use string concatenation for structural data.

Schema Layer should use itemized encryption as a means to control the sharing of data. Data is considered shared with someone if they have access to the session key for that data. One way to do this is to encrypt the item key for anyone that should have read access to the item. The user has an encrypted item key that can be decrypted by a credential of their choice. If the backend system requires access to the item, the item key will additionally be encrypted for the backend's role account credential.

Since anyone who has access to the item key will have access to all data encrypted by that item key, data should be scoped for the reader. Data for a different set of readers should be encrypted by its own item key. Nevertheless, access to the item's cipher-text should still be subject under an access-control list. Even though itemized encryption alone should prevent the enumeration attack, the additional access-control list guards against improperly designed itemized encryption key schedule or derivation.

Database Layer should apply a page-level encryption (such as SQLite Encryption Extension) even though user data is already encrypted by the Schema Layer. This allows Filesystem Layer to replicate or backup the database without worrying that some tables or indices may still contain unencrypted sensitive data.

We do not presently require the Filesystem Layer to implement any mitigation. This allows industry-standard data availability and redundancy practices to continue to be deployed. Vulnerability at this layer comes from the OS vendor and is typically outside of our control, so we mitigate this by having encryption at the Database Layer and the Block Layer.

Block Layer or Bit Layer should apply full-disk encryption so that it is impossible to recover data without the volume encryption key if the storage medium is improperly disposed. Encryption at this level alone is not enough because in case of a stolen laptop, a copy of the decryption key would still be found in system memory somewhere. And there is a good chance that the residual key can still be recovered after the laptop is powered off, e.g. by cryogenically freezing the RAM.

In summary, mitigation strategies tailored for each data storage layer together provide a defense in depth. Reasoning using the DSI model allows us to clearly see how the separation of concerns at each layer contribute to the overall mitigation strategy. And this is how to design a data-breach resilient database.

Monday, April 11, 2022

The Correct Way to Lock a Get-Or-Create Map

A Get-Or-Create map is useful for the "Multiton" (mutiple-singleton) design pattern. Locking is used to make it thread-safe. It may not seem obvious, but the straightforward way to lock a map with Get-Or-Create semantics is problematic.

// Store keeps track of all the objects created so they can be shared.
type Store struct {
  mu sync.Mutex
  m map[string]*Object

func (s *Store) GetOrCreate(key string) (*Object, error) {

  if s.m == nil {
    s.m = make(map[string]*Object)

  if obj, ok := s.m[key]; ok {
    return obj, nil
  obj, err := newObject(key)  // BAD!
  if err != nil {
    return nil, err
  s.m[key] = obj
  return obj, nil

To illustrate the problem, let's say a store already has "bar" but not "foo" saved in the map. Thread-A calls s.GetOrCreate("foo") which necessitates creating a new object for "foo". Meanwhile, Thread-B comes along and simply wants to call s.GetOrCreate("bar") which already exists. Thread-B should be able to just get "bar" without waiting for Thread-A to create "foo", but the lock held by Thread-A during newObject("foo") forces Thread-B to wait.

This is particularly problematic when the object creation takes on the order of seconds which is typical if the object represents a remote network resource, and that is a common use case for Get-Or-Create maps.

Generally speaking, holding a lock while doing blocking operations will severely limit the scalability of a service, so any lock should be confined to the scope of reading or writing an in-memory data structure.

However, Get-Or-Create maps does have to wait in some cases. Even so, we can separate the "fast path" from the "slow path" and only have to wait in the slow path. The vast majority of accesses should fall under the fast path which does not wait for blocking operations.

Consider another Thread-C which also wants to get "foo" while Thread-A is creating it. Thread-C should wait for Thread-A. If Thread-C goes ahead and creates another instance of "foo", then both Thread-A and Thread-C will end up with an object for "foo", and we now have a problem: which "foo" do we put into the map now? For this reason, the map needs a third state to indicate that an object is being created although not ready.

The right mechanism is to use a condition variable in conjunction with a mutex, so that Thread-A could release the lock while creating the new object, Thread-B could retrieve an existing object without blocking, and Thread-C could wait for object creation by Thread-A.

type Store struct {
  mu sync.Mutex
  c sync.Cond
  m map[string]*Object

func (s *Store) GetOrCreate(key string) (*Object, error) {

  if s.m == nil {
    s.m = make(map[string]*Object)

  obj, ok := s.m[key]
  for ok && obj == nil {  // Object exists but it is still being created.
    s.c.Wait()            // Wait for its creation.
    obj, ok = s.m[key]    // And try again.
  if ok && obj != nil {
    return obj

  s.m[key] = nil  // Assume responsibility for creating the object.
  obj, err := newObject(key)  // Avoid holding the lock for lengthy operations.

  if err != nil {
    delete(s.m, key)  // Object is no longer being created.  Next thread will retry.
  } else {
    s.m[key] = obj    // Creation successful.  Next thread will pick this up.
  s.c.Broadcast()     // Wake up everyone waiting one-by-one.
  return obj, err

When Thread-A is about to create a new object, it puts a nil value into the map to indicate that an object is being created but not available. Thread-C will lock the mutex and see that the value is nil, and will wait on the conditional variable (which releases the mutex while waiting). When Thread-A finishes object creation, it locks the mutex again, puts the object into the map, broadcasts a signal through the condition variable, and unlocks the mutex. This will wake up Thread-C to try again. Thread-B is able to obtain "bar" without blocking because the mutex is not held by either Thread-A or Thread-C.

However, if Thread-A is unable to create the object for any reason, it will clear the pending status, still wake up all the waiting threads, and return the error. Thread-C will see that the nil value has been removed from the map and then assume the responsibility for attempting to create a new object.

Monday, March 28, 2022

Self-Driving Car Challenges

Mercedes recently announced level 3 self-driving that takes legal liability. Typically, the levels of autonomy are defined by situations where the car is able to operate without human intervention, but being able to assume legal liability is a huge step forward. Levels of autonomy would only remain a marketing slogan unless it can be tested in court. The cost of liability will be factored into the insurance premium, so self-driving only becomes economically viable if the insurance premium can be lower than that of a human driver.

However, Mercedes' legal responsibility comes with limits: on certain (already-mapped) highways, below 40 mph, during daytime, in reasonably clear weather, without overhead obstructions.

The first challenge that any camera based system must overcome is the stability of the footage. You can see from this test that a car mounted camera with improper image stabilization (circa 2016) would produce a wobbly and shaky image. The wobble can be counteracted by shortening exposure time (high camera shutter speed) but this requires good lighting condition, hence the reasonably clear weather requirement. Furthermore, when the car is traveling fast, you need a telephoto lens to look further ahead for hazardous conditions, but longer telephoto also exacerbates the wobble, hence the operating speed limit of the self-driving. If the video footage is bad, it won't help much if you feed this into machine learning because "garbage in, garbage out." More recent cameras such as a GoPro has improved image stabilization (circa 2021) that also works on a toy car (circa 2020) which is more challenging to stabilize. These cameras can produce clean image under more forgiving lighting conditions.

Car manufacturers who are serious about their self-driving should be licensing camera stabilization technology from the likes of GoPro.

Self-driving cars using LIDAR face a different challenge. LIDAR works by sending a short pulse of light and observe reflections, so it is not dependent on external lighting conditions. But when there are multiple LIDAR equipped cars on the road, they could be picking up each other's signals which might seem like noises. As such, a LIDAR has to encode its own unique ID into the light pulse and filter out any pulse not coming from itself.

A third challenge is about how to legally dispute a claim in court. A self-driving system must be able to produce a rationale why it made any given decision, and the decision has to be legally justifiable. Previously, machine learning is a black box that could produce surprising results (it recognized a dumbbell only because it has an arm attached to it), but explainable AI is making some progress. Similarly, self-driving technology must be explainable.

Explainable self-driving technology can be thought of as a driving instructor that happens to be a computer. The driving instructor not only has to make the right decision on the road, but it also has to explain to a student why it is the right decision given the circumstance.

Car manufacturers that want to assume legal responsibility should aim for a computerized driving instructor instead.

A musician would understand that in order to perform at 100%, they must practice up to 120%. This mindset applies to a lot of engineering problems where safety is at stake. Building structures such as elevators are often designed with load considerations at ~200% rated capacity or more (ASCE/SEI 7-10 (b)). When it comes to self-driving, the specs are less quantifiable, but the design qualifications must similarly exceed expectation in order for the technology to be viable.