Monday, June 13, 2022

Two-Liner Version String Sorting in Python

Version sorting is sometimes called natural sorting as well. There exists a natsort package that caters to the need, and it has more than 1000 lines of code. What if I tell you we can do it in two lines?

Version sorting is the problem that a version string "1.2.1" should be less than "1.10.1", but string comparison puts "1.10.1" before "1.2.1" because the prefix "1.1" < "1.2". Version strings can be arbitrary complex, e.g. "1.2.3beta1" should be less than "1.2.10beta3" or it could be punctuated like "1.2.3-rc3".

The gist of it is that if any part of the string looks like a number, then compare it like a number. Here is the code:

import re

def version_key(key: str) -> tuple:
  """Returns a sortable key in version order."""
  items = re.split(r'(\d+)', key)
  return tuple(int(item) if item.isdigit() else item for item in items)

And it can be used like this:

>>> a = ['version-1.9', 'version-2.0', 'version-1.11', 'version-1.10']
>>> sorted(a, key=version_key)
['version-1.9', 'version-1.10', 'version-1.11', 'version-2.0']

So how does it work? When re.split() is given a delimiter pattern that contains capturing groups, it also returns the capturing group matches in the result list. Effectively, we get a tokenizer that can split numbers and non-numbers into separate tokens.

>>> re.split(r'(\d+)', '1.10.2')
['', '1', '.', '10', '.', '2', '']

Then, the second line converts any string in the list that looks like a number into an integer.

>>> items = ['', '1', '.', '10', '.', '2', '']
>>> tuple(int(item) if item.isdigit() else item for item in items)
('', 1, '.', 10, '.', 2, '')

When this tuple is being used by sorted() for comparison, Python would compare the tuple item-wise. The leading empty string means the original string begins with a number. It is important to keep it so that the item types are aligned to be (str, int, str, int, ..., str) in any version key tuple. If we remove the empty strings, pairwise comparison may end up comparing a str and int, which would be an error.

As a result, we can compare versions that are formatted differently, and it would still work.

>>> a = ['version-1.9', '2.0', 'version-1.11', '1.10']
>>> sorted(a, key=version_key)
['1.10', '2.0', 'version-1.9', 'version-1.11']

Of course, natsort does a lot more. It also has options to extract a floating point number or a human readable comma-delimited number. The working principle will be very similar, just replace the pattern given to re.split() with what you want to be interpreted as a "number" and somehow convert it to a number.

But for sorting a version string, two lines of code is all you need.

No comments: