Skip to content

Next-gen Python DataFrame operations for genomics!

logo

polars-bio is a πŸš€blazing fast Python DataFrame library for genomics🧬 built on top of Apache DataFusion, Apache Arrow and polars. It is designed to be easy to use, fast and memory efficient with a focus on genomics data.

Key Features

  • optimized for peformance and large-scale genomics datasets
  • popular genomics operations with a DataFrame API (both Pandas and polars)
  • SQL-powered bioinformatic data querying or manipulation
  • native parallel engine powered by Apache DataFusion and sequila-native
  • out-of-core/streaming processing (for data too large to fit into a computer's main memory) with Apache DataFusion and polars
  • support for direct streamed reading data from cloud storages (e.g. S3, GCS) enabling processing large-scale genomics data without materializing in memory
  • zero-copy data exchange with Apache Arrow
  • bioinformatics file formats with noodles and exon
  • pre-built wheel packages for Linux, Windows and MacOS (arm64 and x86_64) available on PyPI

Single-thread performance πŸƒβ€

overlap-single.png

overlap-single.png

count-overlaps-single.png

Parallel performance πŸƒβ€πŸƒβ€

overlap-parallel.png

overlap-parallel.png

count-overlaps-parallel.png