Blog

Tuesday, May 12, 2026
in releases, file formats, performance
7 min read

polars-bio 0.31.0: VCF Zarr Support for Array-Native Variant Analytics

polars-bio 0.31.0 adds first-class support for VCF Zarr, giving Python users a fast, lazy, DataFusion-backed way to query VCF-derived Zarr stores with Polars.

Friday, April 24, 2026
in performance, benchmarks, pandas
5 min read

Benchmarking DataFrame Paths in polars-bio 0.29.0

polars-bio 0.29.0 adds support for Pandas >= 3.0.0. Since pandas 3.0 made PyArrow-backed data even more central, with the new default string dtype using pyarrow under the hood when available, we wanted to measure what that means for interval workloads in practice.

So instead of comparing different interval libraries, this benchmark compares different input and execution paths through the same polars-bio range engine:

direct Parquet scan through Apache DataFusion
Pandas DataFrame
Pandas with Arrow-backed dtypes
Polars eager DataFrame
Polars lazy LazyFrame

The question is simple: how much overhead do you pay once data is materialized into a Python DataFrame, and how much of that gap can Arrow-backed Pandas close?

Saturday, March 7, 2026
in releases
2 min read

polars-bio 0.26.0: GTF Support and Smart Tag Type Inference

polars-bio 0.26.0 brings first-class GTF format support, automatic SAM tag type inference for custom/nanopore tags, and critical bug fixes for multi-partition writes and VCF contig metadata.

Friday, February 20, 2026
in performance, benchmarks
4 min read

Interval operations benchmark — update February 2026

Introduction

Back in September 2025 we benchmarked three libraries across three operations. A lot has changed since then. In December 2025, pyranges1 published a preprint describing its Rust-powered backend (ruranges) and an expanded set of interval operations. On the polars-bio side, version 0.24.0 ships a fully rewritten range-operations engine built on upstream DataFusion UDTF providers (OverlapProvider, NearestProvider, and the new coverage/cluster/complement/merge/subtract providers from datafusion-bio-function-ranges), replacing the earlier sequila-native backend.

Saturday, February 14, 2026
in performance, benchmarks, file formats
5 min read

Benchmarking Genomic Format Readers in Python with Polars

Genomic analyses in Python typically start with reading BAM, VCF, or FASTQ files into memory. The choice of library for this step can have a dramatic impact on both wall-clock time and memory consumption — especially as datasets grow to tens or hundreds of millions of records.

pysam has long been the go-to Python library for working with these formats. It provides comprehensive bindings to htslib and is battle-tested across thousands of projects. However, several newer libraries have emerged that leverage Apache Arrow columnar format and Rust-based parsers to offer potentially better performance.

In this post, we benchmark four Python libraries head-to-head on real-world genomic data to find out which offers the best combination of speed and memory efficiency for reading BAM, VCF, and FASTQ files.

Saturday, February 14, 2026
in releases, performance
3 min read

polars-bio 0.23.0: Faster Parsing and Python 3.14 Support

polars-bio 0.23.0 is here with significant parsing performance improvements across VCF, BAM, and FASTQ formats, plus first-day support for Python 3.14. This release bumps the underlying datafusion-bio-formats engine to 0.5.0, delivering up to 3.6x faster VCF parsing with no API changes required.

Monday, September 8, 2025
in performance, benchmarks, file formats
4 min read

GFF File Reading Performance Enhancements in polars-bio 0.15.0

We're excited to announce significant performance improvements to GFF file reading in polars-bio 0.15.0. This release introduces two major optimizations that dramatically improve both speed and memory efficiency when working with GFF files:

Key Enhancements

Projection Pushdown: Only the columns you need are read from disk, reducing I/O overhead and memory usage. This is particularly beneficial when working with wide GFF files that contain many optional attributes.

Predicate Pushdown: Row filtering is applied during the file reading process, eliminating the need to load irrelevant data into memory. This allows for lightning-fast queries on large GFF datasets.

Fully Streamed Parallel Reads: BGZF-compressed files can now be read in parallel with true streaming, enabling out-of-core processing of massive genomic datasets without memory constraints.

Benchmark Methodology

To evaluate these improvements, we conducted comprehensive benchmarks comparing three popular data processing libraries:

Pandas: The traditional Python data analysis library
Polars: High-performance DataFrame library with lazy evaluation
polars-bio: Our specialized genomic data processing library built on Polars and Apache DataFusion

All benchmarks were performed on a large GFF file (~7.7 million records, file and index needed for parallel reading) with both full scan and filtered query scenarios to demonstrate real-world performance gains.

For pandas and polars reading, we used the following methods (thanks to @urineri for the Polars code). Since Polars decompresses compressed CSV/TSV files completely in memory as highlighted here, we also used polars_streaming_csv_decompression, a great plugin developed by @ghuls to enable streaming decompression in Polars.

Test query used for filtered benchmarks (Polars and polars-bio):

 result = (
        lf.filter(
            (pl.col("seqid") == "chrY")
            & (pl.col("start") < 500000)
            & (pl.col("end") > 510000)
        )
        .select(["seqid", "start", "end", "type"])
        .collect()
    )

The above query is very selective and returns only two rows from the entire dataset.

Results

Complete benchmark code and results are available in the polars-bio repository.

Single-threaded performance

Key takeaways:

polars-bio delivers comparable performance to standard Polars for full scan operations and both significantly outperform Pandas.
In the case of filtered queries, we can see further performance improvements with Polars and polars-bio thanks to predicate and projection pushdown optimizations. polars-bio is 2.5x faster than standard Polars.

Memory usage

Key takeaways:

Polars and polars-bio use significantly less memory than Pandas for all operations.
polars-bio and Polars with polars_streaming_csv_decompression can use more than 20x less memory than vanilla Polars and more than two orders of magnitude less memory than Pandas for operations involving filtering.

Thread scalability

Key takeaways:

polars-bio achieves near-linear scaling up to 8 threads for full scan operations, reaching 9.5x speedup at 16 threads compared to single-threaded performance.
Filtered operations show excellent parallelization with polars-bio reaching 11x speedup at 16 threads, significantly outperforming other libraries. There is, however, non-negligible overhead due to parallelism at 1 thread (2.25s vs 4.2s, compared to the single-threaded benchmark).
polars-streaming shows diminishing returns at higher thread counts due to the overhead of spawning decompression program threads (in the default configuration, this is capped at 4), while polars-bio maintains consistent scaling benefits.

Summary

The benchmarks demonstrate that polars-bio 0.15.0 delivers significant performance improvements for GFF file processing. These optimizations, combined with near-linear thread scaling and fully streamed parallel reads, make polars-bio an ideal choice for high-performance genomic data analysis workflows.

If you haven't tried polars-bio yet, now is a great time to explore its capabilities for efficient genomic data processing with Python! Join our upcoming seminar on September 15, 2025, to learn more about polars-bio and its applications in genomics.

Friday, September 5, 2025
in performance, benchmarks
5 min read

Interval operations benchmark — update September 2025

Introduction

Benchmarking isn’t a one-and-done exercise—it’s a moving target. As tools evolve, new versions can shift performance profiles in meaningful ways, so keeping results current is just as important as the first round of measurements.

Recently, three novel libraries that have started to gain traction: pyranges1, GenomicRanges and polars-bio

shipped major updates:

pyranges1 adopted a new Rust backend (ruranges),
GenomicRanges switched its interval core to a Nested Containment List (NCLS) and added multithreaded execution,
polars-bio migrated to the new Polars streaming engine and added support for new interval data structures. As of version 0.12.0 it supports:

Each of these changes has the potential to meaningfully alter performance and memory characteristics for common genomic interval tasks.

In this post, we revisit our benchmarks with those releases in mind. We focus on three everyday operations:

overlap detection,
nearest feature queries
overlap counting.

For comparability, we use the same AIList dataset from our previous write-up, so you can see exactly how the new backends and data structures change the picture. Let’s dive in and see what’s faster, what’s leaner, and where the trade-offs now live.

Setup

Benchmark test cases

Dataset pairs	Size	# of overlaps (1-based)
1-2 & 2-1	Small	54,246
7-3 & 3-7	Medium	4,408,383
8-7 & 7-8	Large	307,184,634

Software versions

Library	Version
polars_bio	0.13.1
pyranges	0.1.14
genomicranges	0.7.2

Results

polars-bio interval data structures performance comparison

Key takeaways:

Superintervals seems to be the best default. Across all three test cases, it is consistently the fastest or tied for fastest, delivering 1.25–1.44x speedups over the polars-bio default (COITrees) and avoiding worst‑case behavior.
Lapper caveat: performs well on 1‑2 and 8‑7, but collapses on 7‑3 (≈25x slower than default), so it’s risky as a general‑purpose algorithm.
Intervaltree/Arrayintervaltree: reliable but slower. They trail superintervals by 20–70% depending on the case.

All operations comparison

Key takeaways:

Overlap: GenomicRanges wins on small inputs (1‑2, 2‑1) by ~2.1–2.3x, but polars‑bio takes over from medium size onward and dominates on large (7‑8, 8‑7), where PyRanges falls far behind. Interesting case of 7-8 vs 8-7 when swapping inputs can significantly affect performance of GenomicRanges.
Nearest: polars‑bio leads decisively at every size; speedups over the others grow with input size (orders of magnitude on large datasets).
Count overlaps: GenomicRanges edges out polars‑bio on the smallest inputs, while polars‑bio is faster on medium and substantially faster on large inputs.

All operations parallel execution

benchmark_speedup_comparison_genomicranges_vs_polars_bio.png

Key takeaways:

Thread scaling: both libraries (GenomicRanges and polars-bio) benefit from additional threads, but the absolute gap favors polars‑bio for medium/large datasets across overlap, nearest, and count overlaps.
Small overlaps: GenomicRanges remains >2x faster at 1–8 threads; on medium/large pairs its relative speed drops below 1x.
Nearest: polars‑bio stays on the 1x reference line; GenomicRanges is typically 10–100x slower (log scale) even with more threads.
Count overlaps: small inputs slightly favor GenomicRanges; for larger inputs polars‑bio maintains 2–10x advantage with stable scaling.

End to-end data proecesing

Here we compare end-to-end performance including data loading, overlap operation, and saving results to CSV.

Info

POLARS_MAX_THREADS=1 was set to ensure fair comparison with single-threaded PyRanges.
Since GenomicRanges supports Polars DataFrames as input and output, we used them instead of Pandas to again ensure fair comparison with polars-bio.
GenomicRanges find_overlaps method returns hits-only table (indices of genomic intervals instead of genomic coordinates), we also benchmarked an extended version with additional lookup of intervals (full rows, code) for fair comparison.

Key takeaways:

Wall time: GenomicRanges (hits‑only) is the fastest end‑to‑end here (~1.16x vs polars_bio) by avoiding full materialization of genomic intervals (unlike PyRanges and polars-bio that return pairs of genomic interval coordinates for each overlap); PyRanges is far slower; GenomicRanges (full rows, so with the output comparable with PyRanges and polars-bio) is much slower.
Memory: polars-bio (streaming) minimizes peak RAM (~0.7 GB) while keeping speed comparable to polars-bio. GenomicRanges (full rows) peaks at ~40 GB; hits‑only sits in the middle (~8.2 GB) as it only returns DataFrame with pairs of indices not full genomic coordinates.

Summary

For small and medium datasets, all tools perform well; at large scale, polars-bio excels with better scalability and memory efficiency, achieving an ultra‑low footprint in streaming mode.