Wednesday, October 2, 2019

Python re.finditer() from an iterable of strings

Earlier today, I told a work colleague that re.finditer() in Python is usually the preferred way to tokenize input, rather than hand-rolling a lexer from an iterable of characters. He said that it requires reading the whole file in memory. On the spot, I sheepishly said that you can probably read it line by line and call re.finditer() repeatedly if the token regexp does not cross line boundary.

That is true, but not a great answer. It turns out that constructing a chained re.finditer() from an iterable of strings is possible although not quite straightforward.

The first attempt here demonstrates the main idea: since re.finditer() matches are non-overlapping, after all matches are exhausted, the next re.finditer() should resume from the remainder of the unmatched string.
def finditer_chained_bad(pattern, strings, flags=0):
    last = ''
    for s in strings:
        m = None
        for m in re.finditer(pattern, last + s, flags):
            yield m
        if not m:
            last = ''  ##1
            continue
        # m is the last match object.
        last = s[m.end(0):]
The line with the comment ##1 may be a point of contention. If re.finditer() had not yielded any match, the implementation decision here is to discard s entirely assuming that nothing in it would ever match, but one could argue that a potential match could be crossing the input string boundary. If this were changed to last += s, then it would work if the potential match crosses input boundary, at the expense of unbounded memory growth if a long span of the input never yielded any match. The unbounded growth is necessary because that means a potential match is larger than the input chunk size, so in order to get the full match, we need to dynamically increase the chunk size, which is what we're doing here.

Another problem is that the last match could be a partial match that continues into the next string. If we yield it naively, the caller would get a broken partial match. To illustrate this, suppose the input is chunked into 8 character strings:
def main():
    for m in finditer_chained_bad(r'\w+', ['hello wo', 'rld']):
        print m.group(0)
It prints:
hello
wo
rld
The word "world" is broken into two matches by mistake due to the string boundary. To fix this, we just need to defer the last match and restart the next re.finditer() from the beginning of the last match. The full working code becomes:
def finditer_chained(pattern, strings, flags=0):
    last_s = ''
    for s in strings:
        last_m = None
        last_s += s
        for m in re.finditer(pattern, last_s, flags):
            if last_m: yield last_m
            last_m = m
        if not last_m:
            continue
        assert last_m is m
        last_s = s[m.start(0):]
    if last_m: yield last_m
And it correctly prints:
hello
world
The resulting finditer_chained() accepts a file-like object because a file is an iterable of lines, as if re.finditer() is applied to the whole file. And it works without reading the whole file in memory.