Benchmarking DataFrame Paths in polars-bio 0.29.0
polars-bio 0.29.0 adds support for Pandas >= 3.0.0. Since pandas 3.0 made PyArrow-backed data even more central, with the new default string dtype using pyarrow under the hood when available, we wanted to measure what that means for interval workloads in practice.
So instead of comparing different interval libraries, this benchmark compares different input and execution paths through the same polars-bio range engine:
- direct Parquet scan through Apache DataFusion
- Pandas
DataFrame - Pandas with Arrow-backed dtypes
- Polars eager
DataFrame - Polars lazy
LazyFrame
The question is simple: how much overhead do you pay once data is materialized into a Python DataFrame, and how much of that gap can Arrow-backed Pandas close?