That is true, but not a great answer. It turns out that constructing a chained re.finditer() from an iterable of strings is possible although not quite straightforward.
The first attempt here demonstrates the main idea: since re.finditer() matches are non-overlapping, after all matches are exhausted, the next re.finditer() should resume from the remainder of the unmatched string.
def finditer_chained_bad(pattern, strings, flags=0): last = '' for s in strings: m = None for m in re.finditer(pattern, last + s, flags): yield m if not m: last = '' ##1 continue # m is the last match object. last = s[m.end(0):]The line with the comment
##1
may be a point of contention. If re.finditer() had not yielded any match, the implementation decision here is to discard s entirely assuming that nothing in it would ever match, but one could argue that a potential match could be crossing the input string boundary. If this were changed to last += s
, then it would work if the potential match crosses input boundary, at the expense of unbounded memory growth if a long span of the input never yielded any match. The unbounded growth is necessary because that means a potential match is larger than the input chunk size, so in order to get the full match, we need to dynamically increase the chunk size, which is what we're doing here.Another problem is that the last match could be a partial match that continues into the next string. If we yield it naively, the caller would get a broken partial match. To illustrate this, suppose the input is chunked into 8 character strings:
def main(): for m in finditer_chained_bad(r'\w+', ['hello wo', 'rld']): print m.group(0)It prints:
hello wo rldThe word "world" is broken into two matches by mistake due to the string boundary. To fix this, we just need to defer the last match and restart the next re.finditer() from the beginning of the last match. The full working code becomes:
def finditer_chained(pattern, strings, flags=0): last_s = '' for s in strings: last_m = None last_s += s for m in re.finditer(pattern, last_s, flags): if last_m: yield last_m last_m = m if not last_m: continue assert last_m is m last_s = s[m.start(0):] if last_m: yield last_mAnd it correctly prints:
hello worldThe resulting finditer_chained() accepts a file-like object because a file is an iterable of lines, as if re.finditer() is applied to the whole file. And it works without reading the whole file in memory.
No comments:
Post a Comment