Large files and ``max_span_lines`` ================================== ``file_re`` is designed around a single dial for memory usage: ``max_span_lines``. The rest of the API is deliberately identical to :mod:`re`, so the only choice you need to make is how wide a match can reasonably span. The three modes --------------- ``max_span_lines=None`` (default) The whole file is read into memory and scanned as one string. This is the full :mod:`re`-equivalent mode: a pattern can match anywhere, including across the entire file. Memory cost is proportional to the file size. Use this for files that comfortably fit in RAM. ``max_span_lines=1`` The file is streamed line by line. A match cannot cross a ``\n``. Memory cost is O(longest line). This is the fastest path for big files when the pattern is single-line. ``max_span_lines=N`` (``N > 1``) The file is streamed through a sliding ``N``-line window. A match may span at most ``N`` lines. Memory cost is roughly O(N x average line length). This is the recommended path for multi-GB log files where individual events span a known, bounded number of lines. Memory trade-offs ----------------- Consider a 50 GB log where events are typically 3-5 lines long. - ``None`` would require ~50 GB of RAM. Not viable. - ``1`` is cheap but will miss any event whose record crosses a newline. - ``5`` keeps memory flat at a few kilobytes and captures the multi-line events correctly. The choice is entirely about the pattern, not the file: .. code-block:: python from file_re import file_re # Single-line pattern: go as tight as possible. for m in file_re.finditer(r"\bERROR\b", "huge.log", max_span_lines=1): handle(m) # Multi-line event that is known to span at most 5 lines. pattern = r"BEGIN TXN\n(?:.*\n){0,3}END TXN" for m in file_re.finditer(pattern, "huge.log", max_span_lines=5): handle(m) Behavior when a match would exceed the window --------------------------------------------- In windowed mode, the scanner only sees ``N`` consecutive lines at a time. A match that would genuinely span more than ``N`` lines is simply never visible to the regex engine and will not be reported. If you suspect events can span more lines than you expect, measure the worst case before sizing the window. Search returns the first match ------------------------------ In windowed mode, :meth:`~file_re.core.file_re_cls.search` returns the first match the scanner encounters as the window slides forward. This matches the semantics of :func:`re.search` against the full concatenated text. If you need the longest match, iterate and pick it yourself: .. code-block:: python from file_re import file_re longest = max( file_re.finditer(r"(hi\n)+", "logs.txt", max_span_lines=10), key=lambda m: m.end() - m.start(), default=None, ) .. note:: This is a breaking change from ``file_re`` 1.x, which attempted to return a "longer" match by continuing to scan after the first hit. See :doc:`migration_1_to_2` for details. Multiprocessing over huge files ------------------------------- For files in the 10 GB-100 GB range, the practical pattern is to split the file into ranges and fan out across a pool of processes. Each worker holds a bounded window (for example ``max_span_lines=5``) so total resident memory stays flat no matter how large the file. .. code-block:: python import os from concurrent.futures import ProcessPoolExecutor from pathlib import Path from file_re import file_re PATTERN = r"BEGIN TXN\n(?:.*\n){0,3}END TXN" def count_matches(shard: Path) -> int: total = 0 for _ in file_re.finditer(PATTERN, shard, max_span_lines=5): total += 1 return total def run(shards: list[Path]) -> int: with ProcessPoolExecutor(max_workers=16) as pool: return sum(pool.map(count_matches, shards)) if __name__ == "__main__": shards = sorted(Path("/var/log/app").glob("*.log.gz")) print(run(shards)) A few notes on this pattern: - Shard at natural file boundaries (per-hour logs, per-host logs, rotated files). Splitting a single file mid-stream requires custom code to avoid tearing events across shard boundaries. - ``file_re`` releases the GIL while reading and matching, so a thread pool also works for IO-heavy workloads. A process pool is still the right choice when the bottleneck is the regex itself. - Wheels are built with the Rust ``regex`` crate's standard tuning, so DFA-friendly patterns (no backreferences, no lookarounds) get the best throughput. Compressed files ---------------- All three modes work transparently on ``.gz`` and ``.xz`` files. The decompression cost is paid per-worker, so with the multiprocessing pattern above, compressed logs scale roughly linearly with worker count.