Interval operations benchmark — update September 2025

Introduction

Benchmarking isn’t a one-and-done exercise—it’s a moving target. As tools evolve, new versions can shift performance profiles in meaningful ways, so keeping results current is just as important as the first round of measurements.

Recently, three novel libraries that have started to gain traction: pyranges1, GenomicRanges and polars-bio

shipped major updates:

pyranges1 adopted a new Rust backend (ruranges),
GenomicRanges switched its interval core to a Nested Containment List (NCLS) and added multithreaded execution,
polars-bio migrated to the new Polars streaming engine and added support for new interval data structures. As of version 0.12.0 it supports:

Each of these changes has the potential to meaningfully alter performance and memory characteristics for common genomic interval tasks.

In this post, we revisit our benchmarks with those releases in mind. We focus on three everyday operations:

overlap detection,
nearest feature queries
overlap counting.

For comparability, we use the same AIList dataset from our previous write-up, so you can see exactly how the new backends and data structures change the picture. Let’s dive in and see what’s faster, what’s leaner, and where the trade-offs now live.

Setup

Benchmark test cases

Dataset pairs	Size	# of overlaps (1-based)
1-2 & 2-1	Small	54,246
7-3 & 3-7	Medium	4,408,383
8-7 & 7-8	Large	307,184,634

Software versions

Library	Version
polars_bio	0.13.1
pyranges	0.1.14
genomicranges	0.7.2

Results

polars-bio interval data structures performance comparison

Key takeaways:

Superintervals seems to be the best default. Across all three test cases, it is consistently the fastest or tied for fastest, delivering 1.25–1.44x speedups over the polars-bio default (COITrees) and avoiding worst‑case behavior.
Lapper caveat: performs well on 1‑2 and 8‑7, but collapses on 7‑3 (≈25x slower than default), so it’s risky as a general‑purpose algorithm.
Intervaltree/Arrayintervaltree: reliable but slower. They trail superintervals by 20–70% depending on the case.

All operations comparison

Key takeaways:

Overlap: GenomicRanges wins on small inputs (1‑2, 2‑1) by ~2.1–2.3x, but polars‑bio takes over from medium size onward and dominates on large (7‑8, 8‑7), where PyRanges falls far behind. Interesting case of 7-8 vs 8-7 when swapping inputs can significantly affect performance of GenomicRanges.
Nearest: polars‑bio leads decisively at every size; speedups over the others grow with input size (orders of magnitude on large datasets).
Count overlaps: GenomicRanges edges out polars‑bio on the smallest inputs, while polars‑bio is faster on medium and substantially faster on large inputs.

All operations parallel execution

Key takeaways:

Thread scaling: both libraries (GenomicRanges and polars-bio) benefit from additional threads, but the absolute gap favors polars‑bio for medium/large datasets across overlap, nearest, and count overlaps.
Small overlaps: GenomicRanges remains >2x faster at 1–8 threads; on medium/large pairs its relative speed drops below 1x.
Nearest: polars‑bio stays on the 1x reference line; GenomicRanges is typically 10–100x slower (log scale) even with more threads.
Count overlaps: small inputs slightly favor GenomicRanges; for larger inputs polars‑bio maintains 2–10x advantage with stable scaling.

End to-end data proecesing

Here we compare end-to-end performance including data loading, overlap operation, and saving results to CSV.

Info

POLARS_MAX_THREADS=1 was set to ensure fair comparison with single-threaded PyRanges.
Since GenomicRanges supports Polars DataFrames as input and output, we used them instead of Pandas to again ensure fair comparison with polars-bio.
GenomicRanges find_overlaps method returns hits-only table (indices of genomic intervals instead of genomic coordinates), we also benchmarked an extended version with additional lookup of intervals (full rows, code) for fair comparison.

Key takeaways:

Wall time: GenomicRanges (hits‑only) is the fastest end‑to‑end here (~1.16x vs polars_bio) by avoiding full materialization of genomic intervals (unlike PyRanges and polars-bio that return pairs of genomic interval coordinates for each overlap); PyRanges is far slower; GenomicRanges (full rows, so with the output comparable with PyRanges and polars-bio) is much slower.
Memory: polars-bio (streaming) minimizes peak RAM (~0.7 GB) while keeping speed comparable to polars-bio. GenomicRanges (full rows) peaks at ~40 GB; hits‑only sits in the middle (~8.2 GB) as it only returns DataFrame with pairs of indices not full genomic coordinates.

Summary

For small and medium datasets, all tools perform well; at large scale, polars-bio excels with better scalability and memory efficiency, achieving an ultra‑low footprint in streaming mode.