Monday, June 13, 2022

Two-Liner Version String Sorting in Python

Version sorting is sometimes called natural sorting as well. There exists a natsort package that caters to the need, and it has more than 1000 lines of code. What if I tell you we can do it in two lines?

Version sorting is the problem that a version string "1.2.1" should be less than "1.10.1", but string comparison puts "1.10.1" before "1.2.1" because the prefix "1.1" < "1.2". Version strings can be arbitrary complex, e.g. "1.2.3beta1" should be less than "1.2.10beta3" or it could be punctuated like "1.2.3-rc3".

The gist of it is that if any part of the string looks like a number, then compare it like a number. Here is the code:

import re

def version_key(key: str) -> tuple:
  """Returns a sortable key in version order."""
  items = re.split(r'(\d+)', key)
  return tuple(int(item) if item.isdigit() else item for item in items)

And it can be used like this:

>>> a = ['version-1.9', 'version-2.0', 'version-1.11', 'version-1.10']
>>> sorted(a, key=version_key)
['version-1.9', 'version-1.10', 'version-1.11', 'version-2.0']

So how does it work? When re.split() is given a delimiter pattern that contains capturing groups, it also returns the capturing group matches in the result list. Effectively, we get a tokenizer that can split numbers and non-numbers into separate tokens.

>>> re.split(r'(\d+)', '1.10.2')
['', '1', '.', '10', '.', '2', '']

Then, the second line converts any string in the list that looks like a number into an integer.

>>> items = ['', '1', '.', '10', '.', '2', '']
>>> tuple(int(item) if item.isdigit() else item for item in items)
('', 1, '.', 10, '.', 2, '')

When this tuple is being used by sorted() for comparison, Python would compare the tuple item-wise. The leading empty string means the original string begins with a number. It is important to keep it so that the item types are aligned to be (str, int, str, int, ..., str) in any version key tuple. If we remove the empty strings, pairwise comparison may end up comparing a str and int, which would be an error.

As a result, we can compare versions that are formatted differently, and it would still work.

>>> a = ['version-1.9', '2.0', 'version-1.11', '1.10']
>>> sorted(a, key=version_key)
['1.10', '2.0', 'version-1.9', 'version-1.11']

Of course, natsort does a lot more. It also has options to extract a floating point number or a human readable comma-delimited number. The working principle will be very similar, just replace the pattern given to re.split() with what you want to be interpreted as a "number" and somehow convert it to a number.

But for sorting a version string, two lines of code is all you need.

Sunday, June 12, 2022

Designing a Data Breach Resilient Database

Rainbow crepe cake, by The Cooking of Joy.

With several high profile data breach settlements—such as Equifax and Morgan Stanley—coming to a conclusion, it begs the question: how do we prevent these data breaches from happening in the first place?

In the case of Equifax, attackers gained unauthorized access via their dispute portal website which allowed them to discover credential to the central database keeping the customer data, and they slowly exfiltrated that data over 76 days to avoid detection.

In the case of Morgan Stanley, the data breach was caused by improper disposal of used hard drives containing customer data.

Security is always as strong as the weakest link, so if we want to prevent future data breach, we have to understand how the storage stack works. It's helpful to think of storage as a layered model like how the OSI model (Open Systems Interconnection) helps us understand networking. We can tentatively call this the DSI model (Data Storage Interconnection). There is a straightforward analogy of the layers between these two models.

To recap, here are the OSI layers.

OSI model

7*Application LayerA website or app that the user interacts with.
6*Presentation LayerInterpretation of the data, e.g. HTML, JSON, UTF-8.
5*Session LayerDefinition of request and response semantics, e.g. HTTP, gRPC.
4Transport LayerData transmission and flow control, e.g. TCP, UDP.
3Network LayerRouting of data between networks, e.g. IPv4, IPv6.
2Datalink LayerNode identification on a local network, e.g. Ethernet MAC.
1Physical LayerModulation of bits on a physical medium, e.g. 10GBASE-T.

* Note: The original OSI model specification only specified the purpose of each layer but not concrete examples of standards. Traditionally, HTTP is considered an "application" known as the web browser, but these days HTTP is also used for a transport like gRPC, which pushes it down to a lower layer. My definition here reflects the modern web2 or web3 stacking.

DSI model
7Application LayerA website or app that the user interacts with.
6Presentation LayerInterpretation of the data, e.g. HTML, JSON, UTF-8.
5Schema LayerModeling of data, e.g. table definitions.
4Database LayerTransaction and indexing, e.g. SQL.
3Filesystem LayerAbstraction for files and directories, e.g. ZFS, NTFS, AWS S3.
2Block LayerBlock devices, e.g. SCSI, SATA.
1Physical LayerModulation of bits on a physical medium, e.g. SMR.

In a similar fashion, we can define data storage layers that are roughly analogous to the OSI layers. Application Layer and Presentation Layer are the same, but layers 5 and below are redefined for data storage purposes. Physical Layer has the same purpose again.

(As a side note, the Presentation Layer and Schema/Session Layer are typically where difficulties may arise when sharing data between different companies, for example to provide universal access to health records, and this is true both in the networking and data storage sense. But that's a topic for another day.)

As we can see, the Morgan Stanley data breach happened at the Physical Layer. The Equifax breach started at the Application Layer, but the data is exfiltrated at the Database Layer. However, data breach can happen at other layers too. Here is a summary of risk factors at each layer.

DSI attack vectors
7Application LayerA vulnerable website allowing access to lower layer credentials, e.g. Equifax.
6Presentation LayerCross-site scripting, where malicious code is injected through data.
5Schema LayerEnumeration attack due to poorly designed data access policies.
4Database LayerExfiltration of data (e.g. Equifax), SQL injection attack.
3Filesystem LayerLocal privilege escalation.
2Block LayerNetwork attached storage in an unsecured network or using an insecure protocol.
1Physical LayerTheft or improper disposal of physical medium, e.g. Morgan Stanley.

* Note: as new attack techniques become known, I'll add them to the correct layer here.

It is tempting to try to put all security burden on the Application Layer, but it is unrealistic to expect that a single layer can handle all the attacks thrown at it. Again, since security is only as strong as the weakest link, we have to fortify each layer.

Application Layer should impose a strict separation of code and data. When possible, the application should be implemented using a compiled language with code placed in a read-only filesystem (see Revival of Harvard Architecture). The application should be shielded behind a monitoring system for detecting unusual access patterns, but reconnaissance alone is not enough. It is often already too late when the exploit is detected, but we still want to have real-time signal for exploit attempts.

Presentation Layer should escape data consistently when inserting one piece of data into another. User-supplied HTML should be parsed and scrubbed to keep only allowlisted elements. A proper templating library should be able to handle that. Don't use string concatenation for structural data.

Schema Layer should use itemized encryption as a means to control the sharing of data. Data is considered shared with someone if they have access to the session key for that data. One way to do this is to encrypt the item key for anyone that should have read access to the item. The user has an encrypted item key that can be decrypted by a credential of their choice. If the backend system requires access to the item, the item key will additionally be encrypted for the backend's role account credential.

Since anyone who has access to the item key will have access to all data encrypted by that item key, data should be scoped for the reader. Data for a different set of readers should be encrypted by its own item key. Nevertheless, access to the item's cipher-text should still be subject under an access-control list. Even though itemized encryption alone should prevent the enumeration attack, the additional access-control list guards against improperly designed itemized encryption key schedule or derivation.

Database Layer should apply a page-level encryption (such as SQLite Encryption Extension) even though user data is already encrypted by the Schema Layer. This allows Filesystem Layer to replicate or backup the database without worrying that some tables or indices may still contain unencrypted sensitive data.

We do not presently require the Filesystem Layer to implement any mitigation. This allows industry-standard data availability and redundancy practices to continue to be deployed. Vulnerability at this layer comes from the OS vendor and is typically outside of our control, so we mitigate this by having encryption at the Database Layer and the Block Layer.

Block Layer or Bit Layer should apply full-disk encryption so that it is impossible to recover data without the volume encryption key if the storage medium is improperly disposed. Encryption at this level alone is not enough because in case of a stolen laptop, a copy of the decryption key would still be found in system memory somewhere. And there is a good chance that the residual key can still be recovered after the laptop is powered off, e.g. by cryogenically freezing the RAM.

In summary, mitigation strategies tailored for each data storage layer together provide a defense in depth. Reasoning using the DSI model allows us to clearly see how the separation of concerns at each layer contribute to the overall mitigation strategy. And this is how to design a data-breach resilient database.