Skip to content

performance

GFF File Reading Performance Enhancements in polars-bio 0.15.0

We're excited to announce significant performance improvements to GFF file reading in polars-bio 0.15.0. This release introduces two major optimizations that dramatically improve both speed and memory efficiency when working with GFF files:

Key Enhancements

Projection Pushdown: Only the columns you need are read from disk, reducing I/O overhead and memory usage. This is particularly beneficial when working with wide GFF files that contain many optional attributes.

Predicate Pushdown: Row filtering is applied during the file reading process, eliminating the need to load irrelevant data into memory. This allows for lightning-fast queries on large GFF datasets.

Fully Streamed Parallel Reads: BGZF-compressed files can now be read in parallel with true streaming, enabling out-of-core processing of massive genomic datasets without memory constraints.

Benchmark Methodology

To evaluate these improvements, we conducted comprehensive benchmarks comparing three popular data processing libraries:

  • Pandas: The traditional Python data analysis library
  • Polars: High-performance DataFrame library with lazy evaluation
  • polars-bio: Our specialized genomic data processing library built on Polars and Apache DataFusion

All benchmarks were performed on a large GFF file (~7.7 million records, file and index needed for parallel reading) with both full scan and filtered query scenarios to demonstrate real-world performance gains.

For pandas and polars reading, we used the following methods (thanks to @urineri for the Polars code). Since Polars decompresses compressed CSV/TSV files completely in memory as highlighted here, we also used polars_streaming_csv_decompression, a great plugin developed by @ghuls to enable streaming decompression in Polars.

Test query used for filtered benchmarks (Polars and polars-bio):

 result = (
        lf.filter(
            (pl.col("seqid") == "chrY")
            & (pl.col("start") < 500000)
            & (pl.col("end") > 510000)
        )
        .select(["seqid", "start", "end", "type"])
        .collect()
    )

The above query is very selective and returns only two rows from the entire dataset.

Results

Complete benchmark code and results are available in the polars-bio repository.

Single-threaded performance

general_performance.png

Key takeaways:

  • polars-bio delivers comparable performance to standard Polars for full scan operations and both significantly outperform Pandas.
  • In the case of filtered queries, we can see further performance improvements with Polars and polars-bio thanks to predicate and projection pushdown optimizations. polars-bio is 2.5x faster than standard Polars.

Memory usage

memory_comparison.png

Key takeaways:

  • Polars and polars-bio use significantly less memory than Pandas for all operations.
  • polars-bio and Polars with polars_streaming_csv_decompression can use more than 20x less memory than vanilla Polars and more than two orders of magnitude less memory than Pandas for operations involving filtering.

Thread scalability

thread_scalability.png

Key takeaways:

  • polars-bio achieves near-linear scaling up to 8 threads for full scan operations, reaching 9.5x speedup at 16 threads compared to single-threaded performance.
  • Filtered operations show excellent parallelization with polars-bio reaching 11x speedup at 16 threads, significantly outperforming other libraries. There is, however, non-negligible overhead due to parallelism at 1 thread (2.25s vs 4.2s, compared to the single-threaded benchmark).
  • polars-streaming shows diminishing returns at higher thread counts due to the overhead of spawning decompression program threads (in the default configuration, this is capped at 4), while polars-bio maintains consistent scaling benefits.

Summary

The benchmarks demonstrate that polars-bio 0.15.0 delivers significant performance improvements for GFF file processing. These optimizations, combined with near-linear thread scaling and fully streamed parallel reads, make polars-bio an ideal choice for high-performance genomic data analysis workflows.

If you haven't tried polars-bio yet, now is a great time to explore its capabilities for efficient genomic data processing with Python! Join our upcoming seminar on September 15, 2025, to learn more about polars-bio and its applications in genomics. Polars-bio High-Performance Python DataFrame Operations for Genomics.png

Interval operations benchmark — update September 2025

Introduction

Benchmarking isn’t a one-and-done exercise—it’s a moving target. As tools evolve, new versions can shift performance profiles in meaningful ways, so keeping results current is just as important as the first round of measurements.

Recently, three novel libraries that have started to gain traction: pyranges1, GenomicRanges and polars-bio star-history-202595.png

shipped major updates:

Each of these changes has the potential to meaningfully alter performance and memory characteristics for common genomic interval tasks.

In this post, we revisit our benchmarks with those releases in mind. We focus on three everyday operations:

  • overlap detection,
  • nearest feature queries
  • overlap counting.

For comparability, we use the same AIList dataset from our previous write-up, so you can see exactly how the new backends and data structures change the picture. Let’s dive in and see what’s faster, what’s leaner, and where the trade-offs now live.

Setup

Benchmark test cases

Dataset pairs Size # of overlaps (1-based)
1-2 & 2-1 Small 54,246
7-3 & 3-7 Medium 4,408,383
8-7 & 7-8 Large 307,184,634

Software versions

Library Version
polars_bio 0.13.1
pyranges 0.1.14
genomicranges 0.7.2

Results

polars-bio interval data structures performance comparison

combined_multi_testcase.png

Key takeaways:

  • Superintervals seems to be the best default. Across all three test cases, it is consistently the fastest or tied for fastest, delivering 1.25–1.44x speedups over the polars-bio default (COITrees) and avoiding worst‑case behavior.
  • Lapper caveat: performs well on 1‑2 and 8‑7, but collapses on 7‑3 (≈25x slower than default), so it’s risky as a general‑purpose algorithm.
  • Intervaltree/Arrayintervaltree: reliable but slower. They trail superintervals by 20–70% depending on the case.

All operations comparison

all_operations_walltime_comparison.png

bench-20250-all_operations_speedup_comparison.png

Key takeaways:

  • Overlap: GenomicRanges wins on small inputs (1‑2, 2‑1) by ~2.1–2.3x, but polars‑bio takes over from medium size onward and dominates on large (7‑8, 8‑7), where PyRanges falls far behind. Interesting case of 7-8 vs 8-7 when swapping inputs can significantly affect performance of GenomicRanges.
  • Nearest: polars‑bio leads decisively at every size; speedups over the others grow with input size (orders of magnitude on large datasets).
  • Count overlaps: GenomicRanges edges out polars‑bio on the smallest inputs, while polars‑bio is faster on medium and substantially faster on large inputs.

All operations parallel execution

benchmark_comparison_genomicranges_vs_polars_bio.png

benchmark_speedup_comparison_genomicranges_vs_polars_bio.png

Key takeaways:

  • Thread scaling: both libraries (GenomicRanges and polars-bio) benefit from additional threads, but the absolute gap favors polars‑bio for medium/large datasets across overlap, nearest, and count overlaps.
  • Small overlaps: GenomicRanges remains >2x faster at 1–8 threads; on medium/large pairs its relative speed drops below 1x.
  • Nearest: polars‑bio stays on the 1x reference line; GenomicRanges is typically 10–100x slower (log scale) even with more threads.
  • Count overlaps: small inputs slightly favor GenomicRanges; for larger inputs polars‑bio maintains 2–10x advantage with stable scaling.

End to-end data proecesing

Here we compare end-to-end performance including data loading, overlap operation, and saving results to CSV.

Info

  1. POLARS_MAX_THREADS=1 was set to ensure fair comparison with single-threaded PyRanges.
  2. Since GenomicRanges supports Polars DataFrames as input and output, we used them instead of Pandas to again ensure fair comparison with polars-bio.
  3. GenomicRanges find_overlaps method returns hits-only table (indices of genomic intervals instead of genomic coordinates), we also benchmarked an extended version with additional lookup of intervals (full rows, code) for fair comparison.

combined_benchmark_visualization.png

Key takeaways:

  • Wall time: GenomicRanges (hits‑only) is the fastest end‑to‑end here (~1.16x vs polars_bio) by avoiding full materialization of genomic intervals (unlike PyRanges and polars-bio that return pairs of genomic interval coordinates for each overlap); PyRanges is far slower; GenomicRanges (full rows, so with the output comparable with PyRanges and polars-bio) is much slower.
  • Memory: polars-bio (streaming) minimizes peak RAM (~0.7 GB) while keeping speed comparable to polars-bio. GenomicRanges (full rows) peaks at ~40 GB; hits‑only sits in the middle (~8.2 GB) as it only returns DataFrame with pairs of indices not full genomic coordinates.

Summary

For small and medium datasets, all tools perform well; at large scale, polars-bio excels with better scalability and memory efficiency, achieving an ultra‑low footprint in streaming mode.