Next-gen Python DataFrame operations for genomics!
polars-bio is a blazing fast Python DataFrame library for genomics𧬠built on top of Apache DataFusion, Apache Arrow
and polars.
It is designed to be easy to use, fast and memory efficient with a focus on genomics data.
Key Features
- optimized for peformance and memory efficiency for large-scale genomics datasets analyses both when reading input data and performing operations
- popular genomics operations with a DataFrame API (both Pandas and polars)
- SQL-powered bioinformatic data querying or manipulation/pre-processing
- native parallel engine powered by Apache DataFusion and sequila-native
- out-of-core/streaming processing (for data too large to fit into a computer's main memory) with Apache DataFusion and polars
- support for federated and streamed reading data from cloud storages (e.g. S3, GCS) with Apache OpenDAL enabling processing large-scale genomics data without materializing in memory
- zero-copy data exchange with Apache Arrow
- bioinformatics file formats with noodles and exon
- fast overlap operations with COITrees: Cache Oblivious Interval Trees
- pre-built wheel packages for Linux, Windows and MacOS (arm64 and x86_64) available on PyPI
See quick start for the installation options.
Citing
If you use polars-bio in your work, please cite:
@article {Wiewiorka2025.03.21.644629,
author = {Wiewiorka, Marek and Khamutou, Pavel and Zbysinski, Marek and Gambin, Tomasz},
title = {polars-bio - fast, scalable and out-of-core operations on large genomic interval datasets},
elocation-id = {2025.03.21.644629},
year = {2025},
doi = {10.1101/2025.03.21.644629},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629},
eprint = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629.full.pdf},
journal = {bioRxiv}
}
Example
Discovering genomics data in VCF files
import polars_bio as pb
import polars as pl
pl.Config(fmt_str_lengths=1000, tbl_width_chars=1000)
pl.Config.set_tbl_cols(100)
vcf_1 = "gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz"
pb.describe_vcf(vcf_1).filter(pl.col("description").str.contains(r"Latino.* allele frequency"))
shape: (3, 3)
βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β name β type β description β
β --- β --- β --- β
β str β str β str β
βββββββββββββͺββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β AF_amr β Float β Latino allele frequency (biallelic sites only). β
β AF_amr_XY β Float β Latino XY allele frequency (biallelic sites only). β
β AF_amr_XX β Float β Latino XX allele frequency (biallelic sites only). β
βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interactive querying of genomics data
pb.register_vcf(vcf_1, "gnomad",
info_fields=['AF_amr'])
query = """
SELECT
chrom,
start,
end,
alt,
array_element(af_amr,1) AS af_amr
FROM gnomad
WHERE
filter = 'HIGH_NCR'
AND
alt = '<DUP>'
"""
pb.sql(f"{query} LIMIT 3").collect()
shape: (3, 5)
βββββββββ¬βββββββββ¬βββββββββ¬ββββββββ¬βββββββββββ
β chrom β start β end β alt β af_amr β
β --- β --- β --- β --- β --- β
β str β u32 β u32 β str β f32 β
βββββββββͺβββββββββͺβββββββββͺββββββββͺβββββββββββ‘
β chr1 β 10000 β 295666 β <DUP> β 0.000293 β
β chr1 β 138000 β 144000 β <DUP> β 0.000166 β
β chr1 β 160500 β 172100 β <DUP> β 0.002639 β
βββββββββ΄βββββββββ΄βββββββββ΄ββββββββ΄βββββββββββ
Creating a view and overlapping with a VCF file from another source
pb.register_view("v_gnomad", query)
pb.overlap("v_gnomad", "s3://gnomad-public-us-east-1/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr1.vcf.bgz" , suffixes=("_1", "_2")).collect()
3rows [00:39, 13.20s/rows]
shape: (3, 13)
βββββββββββ¬ββββββββββ¬βββββββββ¬ββββββββββ¬ββββ¬ββββββββ¬ββββββββ¬βββββββββ¬ββββββββββββββ
β chrom_1 β start_1 β end_1 β chrom_2 β β¦ β ref_2 β alt_2 β qual_2 β filter_2 β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β str β u32 β u32 β str β β str β str β f64 β str β
βββββββββββͺββββββββββͺβββββββββͺββββββββββͺββββͺββββββββͺββββββββͺβββββββββͺββββββββββββββ‘
β chr1 β 10000 β 295666 β chr1 β β¦ β T β C β 0.0 β AC0;AS_VQSR β
β chr1 β 11000 β 51000 β chr1 β β¦ β T β C β 0.0 β AC0;AS_VQSR β
β chr1 β 10000 β 295666 β chr1 β β¦ β G β A β 0.0 β AC0;AS_VQSR β
βββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄ββββ΄ββββββββ΄ββββββββ΄βββββββββ΄ββββββββββββββ
Note
The above example demonstrates how to use polars-bio to query and overlap genomics data from different sources over the network. The performance of the operations can be impacted by the available network throughput and the size of the data being processed.
pb.overlap("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "/tmp/gnomad.exomes.v4.1.sites.chr1.vcf.bgz").limit(3).collect()
3rows [00:10, 3.49s/rows]
shape: (3, 16)
βββββββββββ¬ββββββββββ¬βββββββββ¬ββββββββββ¬ββββ¬ββββββββ¬ββββββββ¬βββββββββ¬ββββββββββββββ
β chrom_1 β start_1 β end_1 β chrom_2 β β¦ β ref_2 β alt_2 β qual_2 β filter_2 β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β str β u32 β u32 β str β β str β str β f64 β str β
βββββββββββͺββββββββββͺβββββββββͺββββββββββͺββββͺββββββββͺββββββββͺβββββββββͺββββββββββββββ‘
β chr1 β 10000 β 295666 β chr1 β β¦ β T β C β 0.0 β AC0;AS_VQSR β
β chr1 β 11000 β 51000 β chr1 β β¦ β T β C β 0.0 β AC0;AS_VQSR β
β chr1 β 12000 β 32000 β chr1 β β¦ β G β A β 0.0 β AC0;AS_VQSR β
βββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββ΄ββββ΄ββββββββ΄ββββββββ΄βββββββββ΄ββββββββββββββ
Register a table from a Polars DataFrame
You can register a table from a Polars DataFrame, e.g. for creating a custom annotation table.
df = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [11993, 12102],
"end": [11996, 12200],
"annotation": ["ann1", "ann2"]
})
pb.from_polars("test_annotation", df)
pb.sql("SELECT * FROM test_annotation").collect()
shape: (2, 4)
βββββββββ¬ββββββββ¬ββββββββ¬βββββββββββββ
β chrom β start β end β annotation β
β --- β --- β --- β --- β
β str β i64 β i64 β str β
βββββββββͺββββββββͺββββββββͺβββββββββββββ‘
β chr1 β 11993 β 11996 β ann1 β
β chr1 β 12102 β 12200 β ann2 β
βββββββββ΄ββββββββ΄ββββββββ΄βββββββββββββ
pb.overlap("test_annotation", "s3://gnomad-public-us-east-1/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr1.vcf.bgz").limit(3).collect()
3rows [00:02, 1.40rows/s]
shape: (3, 12)
βββββββββββ¬ββββββββββ¬ββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββ¬ββββββββ¬βββββββββ¬ββββββββββββββ
β chrom_1 β start_1 β end_1 β chrom_2 β start_2 β end_2 β annotation_1 β id_2 β ref_2 β alt_2 β qual_2 β filter_2 β
β --- β --- β --- β --- β --- β --- β --- β --- β --- β --- β --- β --- β
β str β i64 β i64 β str β u32 β u32 β str β str β str β str β f64 β str β
βββββββββββͺββββββββββͺββββββββͺββββββββββͺββββββββββͺββββββββͺβββββββββββββββͺβββββββββββββββͺββββββββͺββββββββͺβββββββββͺββββββββββββββ‘
β chr1 β 11993 β 11996 β chr1 β 11994 β 11994 β ann1 β β T β C β 0.0 β AC0;AS_VQSR β
β chr1 β 12102 β 12200 β chr1 β 12106 β 12106 β ann2 β β T β G β 0.0 β AC0;AS_VQSR β
β chr1 β 12102 β 12200 β chr1 β 12138 β 12138 β ann2 β rs1553119361 β C β A β 0.0 β AS_VQSR β
βββββββββββ΄ββββββββββ΄ββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββ΄ββββββββ΄βββββββββ΄ββββββββββββββ
Parallel file readers
The performance when reading data from local files is significantly better than reading data over the network. If you are interested in the performance of the operations, you can additionally enable multithreaded reading of the data.
1 thread
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_site_local", thread_num=1)
pb.sql("select * from gnomad_site_local").collect().count()
2154486rows [00:10, 204011.57rows/s]
shape: (1, 8)
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β chrom β start β end β id β ref β alt β qual β filter β
β --- β --- β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β u32 β u32 β
βββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
2 threads
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_site_local", thread_num=2)
pb.sql("select * from gnomad_site_local").collect().count()
2154486rows [00:06, 347012.53rows/s]
shape: (1, 8)
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β chrom β start β end β id β ref β alt β qual β filter β
β --- β --- β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β u32 β u32 β
βββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
4 threads
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_site_local", thread_num=4)
pb.sql("select * from gnomad_site_local").collect().count()
2154486rows [00:03, 639138.47rows/s]
shape: (1, 8)
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β chrom β start β end β id β ref β alt β qual β filter β
β --- β --- β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β u32 β u32 β
βββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
6 threads
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_site_local", thread_num=4)
pb.sql("select * from gnomad_site_local").collect().count()
2154486rows [00:02, 780228.99rows/s]
shape: (1, 8)
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β chrom β start β end β id β ref β alt β qual β filter β
β --- β --- β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β u32 β u32 β
βββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘
β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β 2154486 β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
You can easily see the performance improvement when using multi-threading: throughput: ~205k rows/s 350k rows/s vs ~640k rows/s vs ~780k rows/s and processing time reduction: ~10s vs ~6s vs ~3s vs ~2s.
Parallel operations
Let's prepare the data for the parallel operations.
Tip
As you can see in the previous examples you can efficiently read and process data in bioinformatic format using polars-bio. However, if you really would like to get the best performance you can use big-data ready format, such as Parquet.
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_site_local", thread_num=4)
pb.sql("SELECT chrom, start, end FROM gnomad_site_local", streaming=True).sink_parquet("/tmp/gnomad.v4.1.sv.sites.parquet")
pb.register_vcf("/tmp/gnomad.exomes.v4.1.sites.chr1.vcf.bgz", "gnomad_exomes_local", thread_num=4)
pb.sql("SELECT chrom, start, end FROM gnomad_exomes_local", streaming=True).sink_parquet("/tmp/gnomad.exomes.v4.1.sites.chr1.parquet")
streaming=True
parameter to enable the streaming mode. This mode is useful when you want to process data that is too large to fit into memory.
See streaming for more details.
import polars as pl
pl.scan_parquet("/tmp/gnomad.v4.1.sv.sites.parquet").collect().count()
shape: (1, 3)
βββββββββββ¬ββββββββββ¬ββββββββββ
β chrom β start β end β
β --- β --- β --- β
β u32 β u32 β u32 β
βββββββββββͺββββββββββͺββββββββββ‘
β 2154486 β 2154486 β 2154486 β
βββββββββββ΄ββββββββββ΄ββββββββββ
>>> pl.scan_parquet("/tmp/gnomad.exomes.v4.1.sites.chr1.parquet").collect().count()
shape: (1, 3)
ββββββββββββ¬βββββββββββ¬βββββββββββ
β chrom β start β end β
β --- β --- β --- β
β u32 β u32 β u32 β
ββββββββββββͺβββββββββββͺβββββββββββ‘
β 17671166 β 17671166 β 17671166 β
ββββββββββββ΄βββββββββββ΄βββββββββββ
1 thread
python -m timeit -n 3 -r 1 -s 'import polars_bio as pb; pb.set_option("datafusion.execution.target_partitions", "1")' \
'pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").count()'
3 loops, best of 1: 33.3 sec per loop
2 threads
python -m timeit -n 3 -r 1 -s 'import polars_bio as pb; pb.set_option("datafusion.execution.target_partitions", "2")' \
'pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").count()'
3 loops, best of 1: 20.8 sec per loop
4 threads
python -m timeit -n 3 -r 1 -s 'import polars_bio as pb; pb.set_option("datafusion.execution.target_partitions", "4")' \
'pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").count()'
3 loops, best of 1: 12.5 sec per loop
6 threads
python -m timeit -n 3 -r 1 -s 'import polars_bio as pb; pb.set_option("datafusion.execution.target_partitions", "6")' \
'pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").count()'
3 loops, best of 1: 8.62 sec per loop
12 threads
python -m timeit -n 3 -r 1 -s 'import polars_bio as pb; pb.set_option("datafusion.execution.target_partitions", "12")' \
'pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").count()'
3 loops, best of 1: 4.57 sec per loop
Finally, let's save the result to a Parquet file and check the results.
pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").write_parquet("/tmp/overlap.parquet")
pl.scan_parquet("/tmp/overlap.parquet").collect().count()
shape: (1, 6)
ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ
β chrom_1 β start_1 β end_1 β chrom_2 β start_2 β end_2 β
β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β
ββββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββ‘
β 2629727337 β 2629727337 β 2629727337 β 2629727337 β 2629727337 β 2629727337 β
ββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ
See supported operations for more details.