Skip to content

Next-gen Python DataFrame operations for genomics!

logo

polars-bio is a πŸš€blazing fast Python DataFrame library for genomics🧬 built on top of Apache DataFusion, Apache Arrow and polars. It is designed to be easy to use, fast and memory efficient with a focus on genomics data.

Key Features

See quick start for the installation options.

Citing

If you use polars-bio in your work, please cite:

@article {Wiewiorka2025.03.21.644629,
    author = {Wiewiorka, Marek and Khamutou, Pavel and Zbysinski, Marek and Gambin, Tomasz},
    title = {polars-bio - fast, scalable and out-of-core operations on large genomic interval datasets},
    elocation-id = {2025.03.21.644629},
    year = {2025},
    doi = {10.1101/2025.03.21.644629},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629},
    eprint = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629.full.pdf},
    journal = {bioRxiv}
}

Performance benchmarks

Single-thread πŸƒβ€

overlap-single.png

overlap-single.png

count-overlaps-single.png

coverage-single.png

Parallel πŸƒβ€πŸƒβ€

overlap-parallel.png

overlap-parallel.png

count-overlaps-parallel.png

coverage-parallel.png