Large files and max_span_lines
file_re is designed around a single dial for memory usage:
max_span_lines. The rest of the API is deliberately identical to
re, so the only choice you need to make is how wide a match can
reasonably span.
The three modes
max_span_lines=None(default)The whole file is read into memory and scanned as one string. This is the full
re-equivalent mode: a pattern can match anywhere, including across the entire file. Memory cost is proportional to the file size. Use this for files that comfortably fit in RAM.max_span_lines=1The file is streamed line by line. A match cannot cross a
\n. Memory cost is O(longest line). This is the fastest path for big files when the pattern is single-line.max_span_lines=N(N > 1)The file is streamed through a sliding
N-line window. A match may span at mostNlines. Memory cost is roughly O(N x average line length). This is the recommended path for multi-GB log files where individual events span a known, bounded number of lines.
Memory trade-offs
Consider a 50 GB log where events are typically 3-5 lines long.
Nonewould require ~50 GB of RAM. Not viable.1is cheap but will miss any event whose record crosses a newline.5keeps memory flat at a few kilobytes and captures the multi-line events correctly.
The choice is entirely about the pattern, not the file:
from file_re import file_re
# Single-line pattern: go as tight as possible.
for m in file_re.finditer(r"\bERROR\b", "huge.log", max_span_lines=1):
handle(m)
# Multi-line event that is known to span at most 5 lines.
pattern = r"BEGIN TXN\n(?:.*\n){0,3}END TXN"
for m in file_re.finditer(pattern, "huge.log", max_span_lines=5):
handle(m)
Behavior when a match would exceed the window
In windowed mode, the scanner only sees N consecutive lines at a
time. A match that would genuinely span more than N lines is simply
never visible to the regex engine and will not be reported. If you
suspect events can span more lines than you expect, measure the worst
case before sizing the window.
Search returns the first match
In windowed mode, search() returns the
first match the scanner encounters as the window slides forward. This
matches the semantics of re.search() against the full concatenated
text. If you need the longest match, iterate and pick it yourself:
from file_re import file_re
longest = max(
file_re.finditer(r"(hi\n)+", "logs.txt", max_span_lines=10),
key=lambda m: m.end() - m.start(),
default=None,
)
Note
This is a breaking change from file_re 1.x, which attempted to
return a “longer” match by continuing to scan after the first hit.
See Migrating from 1.x to 2.0 for details.
Multiprocessing over huge files
For files in the 10 GB-100 GB range, the practical pattern is to split
the file into ranges and fan out across a pool of processes. Each
worker holds a bounded window (for example max_span_lines=5) so
total resident memory stays flat no matter how large the file.
import os
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
from file_re import file_re
PATTERN = r"BEGIN TXN\n(?:.*\n){0,3}END TXN"
def count_matches(shard: Path) -> int:
total = 0
for _ in file_re.finditer(PATTERN, shard, max_span_lines=5):
total += 1
return total
def run(shards: list[Path]) -> int:
with ProcessPoolExecutor(max_workers=16) as pool:
return sum(pool.map(count_matches, shards))
if __name__ == "__main__":
shards = sorted(Path("/var/log/app").glob("*.log.gz"))
print(run(shards))
A few notes on this pattern:
Shard at natural file boundaries (per-hour logs, per-host logs, rotated files). Splitting a single file mid-stream requires custom code to avoid tearing events across shard boundaries.
file_rereleases the GIL while reading and matching, so a thread pool also works for IO-heavy workloads. A process pool is still the right choice when the bottleneck is the regex itself.Wheels are built with the Rust
regexcrate’s standard tuning, so DFA-friendly patterns (no backreferences, no lookarounds) get the best throughput.
Compressed files
All three modes work transparently on .gz and .xz files. The
decompression cost is paid per-worker, so with the multiprocessing
pattern above, compressed logs scale roughly linearly with worker
count.