Next-gen Python DataFrame operations for genomics!
polars-bio is a blazing fast Python DataFrame library for genomics𧬠built on top of Apache DataFusion, Apache Arrow
and polars.
It is designed to be easy to use, fast and memory efficient with a focus on genomics data.
Key Features
- optimized for peformance and memory efficiency for large-scale genomics datasets analyses both when reading input data and performing operations
- popular genomics operations with a DataFrame API (both Pandas and polars)
- SQL-powered bioinformatic data querying or manipulation/pre-processing
- native parallel engine powered by Apache DataFusion and sequila-native
- out-of-core/streaming processing (for data too large to fit into a computer's main memory) with Apache DataFusion and polars
- support for federated and streamed reading data from cloud storages (e.g. S3, GCS) with Apache OpenDAL enabling processing large-scale genomics data without materializing in memory
- zero-copy data exchange with Apache Arrow
- bioinformatics file formats with noodles and exon
- fast overlap operations with COITrees: Cache Oblivious Interval Trees
- pre-built wheel packages for Linux, Windows and MacOS (arm64 and x86_64) available on PyPI
See quick start for the installation options.
Citing
If you use polars-bio in your work, please cite:
@article {Wiewiorka2025.03.21.644629,
author = {Wiewiorka, Marek and Khamutou, Pavel and Zbysinski, Marek and Gambin, Tomasz},
title = {polars-bio - fast, scalable and out-of-core operations on large genomic interval datasets},
elocation-id = {2025.03.21.644629},
year = {2025},
doi = {10.1101/2025.03.21.644629},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629},
eprint = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629.full.pdf},
journal = {bioRxiv}
}