π οΈ Developers guide
Architecture Overview
polars-bio is a Python library for high-performance genomic data processing. It combines a Python API layer with a Rust-based execution engine powered by Apache DataFusion, extended with bioinformatics-specific capabilities through upstream datafusion-bio-* crates.
flowchart TB
subgraph Python["Python API Layer"]
direction LR
io["I/O Operations<br/>(scan/read/write)"]
range["Range Operations<br/>(overlap, nearest, ...)"]
pileup["Pileup Operations<br/>(depth)"]
sql["SQL API"]
end
subgraph PyO3["PyO3 Rust Bindings"]
direction LR
py_register["py_register_table()"]
py_read["py_read_table() / py_read_sql()"]
py_pileup["py_pileup_depth()"]
end
subgraph DataFusion["Apache DataFusion Engine"]
direction LR
ctx["SessionContext"]
optimizer["Physical Optimizer"]
exec["Execution Engine"]
end
subgraph Upstream["Upstream Crates"]
direction LR
formats["datafusion-bio-formats<br/>(9 sub-crates)"]
functions["datafusion-bio-functions<br/>(ranges + pileup)"]
end
subgraph Storage["Data Sources"]
direction LR
local["Local Files"]
cloud["Cloud Storage<br/>(S3, GCS, Azure)"]
end
Python --> PyO3
PyO3 --> DataFusion
DataFusion --> Upstream
Upstream --> Storage
The key design principles are:
- Zero-copy data exchange between Python (Polars/Pandas) and Rust via the Arrow C Data Interface
- Predicate and projection pushdown through DataFusion's optimizer to minimize I/O
- Streaming execution for out-of-core processing of datasets larger than memory
- Parallel execution via DataFusion's partitioned execution model
Upstream Repositories
polars-bio relies on two upstream repository families that provide the DataFusion extensions for bioinformatics workloads. These are pinned as git dependencies in Cargo.toml.
datafusion-bio-formats
Repository: github.com/biodatageeks/datafusion-bio-formats
Provides DataFusion TableProvider implementations for bioinformatics file formats, using noodles for low-level format I/O.
| Crate | Format | Features |
|---|---|---|
datafusion-bio-format-bam |
BAM | Indexed reads (BAI/CSI), predicate & projection pushdown, parallel partitioning |
datafusion-bio-format-cram |
CRAM | Indexed reads (CRAI), predicate & projection pushdown, reference-based compression |
datafusion-bio-format-vcf |
VCF | Indexed reads (TBI/CSI), predicate & projection pushdown, INFO/FORMAT field extraction |
datafusion-bio-format-gff |
GFF3 | Indexed reads (TBI/CSI), predicate & projection pushdown, attribute field extraction |
datafusion-bio-format-bed |
BED | Single-threaded reads, limit pushdown |
datafusion-bio-format-fastq |
FASTQ | GZI-indexed parallel BGZF decoding, limit pushdown |
datafusion-bio-format-fasta |
FASTA | Single-threaded reads, limit pushdown |
datafusion-bio-format-pairs |
Pairs | Indexed reads (TBI/CSI), predicate & projection pushdown |
datafusion-bio-format-core |
Core | Shared utilities, coordinate system metadata, OpenDAL cloud storage integration |
datafusion-bio-functions
Repository: github.com/biodatageeks/datafusion-bio-functions
Provides DataFusion extensions for genomic range operations and pileup computation.
| Crate | Purpose | Mechanism |
|---|---|---|
datafusion-bio-function-ranges |
Overlap, nearest, coverage, count-overlaps | PhysicalOptimizerRule for interval join rewriting + UDTF for coverage/count-overlaps |
datafusion-bio-function-pileup |
Per-base read depth (pileup) | UDTF registered as depth() in SQL |
Dependency pinning
Both repositories are pinned to specific git revisions in Cargo.toml:
# Format crates (all pinned to the same revision)
datafusion-bio-format-vcf = { git = "https://github.com/biodatageeks/datafusion-bio-formats.git", rev = "..." }
datafusion-bio-format-bam = { git = "https://github.com/biodatageeks/datafusion-bio-formats.git", rev = "..." }
# ... (9 sub-crates total)
# Function crates
datafusion-bio-function-ranges = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "..." }
datafusion-bio-function-pileup = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "...", default-features = false }
Note
datafusion-bio-function-pileup uses default-features = false to avoid a duplicate BAM dependency conflict with datafusion-bio-format-bam.
I/O Pipeline & Polars IO Plugin
polars-bio uses the Polars IO plugin mechanism (register_io_source) to bridge between DataFusion's streaming execution and Polars' lazy evaluation. This section documents the end-to-end I/O pipeline and the pushdown optimizations applied at each stage.
Overview
Every scan_*() / read_*() call flows through the same pipeline:
flowchart TB
subgraph User["User API"]
scan["scan_bam() / scan_vcf() / scan_gff() / ..."]
end
subgraph ReadFile["_read_file()"]
register["1. py_register_table()<br/>Register DataFusion TableProvider"]
schema["2. py_get_table_schema()<br/>Extract Arrow schema + metadata"]
extract["3. extract_all_schema_metadata()<br/>Parse bio.* metadata"]
lazy["4. _lazy_scan()<br/>Create Polars LazyFrame via IO plugin"]
meta["5. set_coordinate_system() +<br/>set_source_metadata()"]
end
subgraph IOPlugin["Polars IO Plugin (_overlap_source callback)"]
direction TB
predicate["Predicate pushdown"]
projection["Projection pushdown"]
limit["Limit pushdown"]
stream["execute_stream()<br/>Yield Arrow batches"]
safety["Safety net:<br/>client-side filter/select"]
end
subgraph DataFusion["DataFusion Engine"]
tp["TableProvider<br/>(BamTableProvider, VcfTableProvider, ...)"]
idx["Index-based partition pruning<br/>(BAI/CSI/TBI/CRAI)"]
record["Record-level filtering"]
proj["Column-level skip"]
end
scan --> ReadFile
ReadFile --> IOPlugin
IOPlugin --> DataFusion
Step-by-step
1. Registration β py_register_table() (Rust/PyO3) creates a DataFusion TableProvider for the given format (e.g., BamTableProvider, VcfTableProvider). The provider handles file opening, index discovery, schema inference, and partition planning.
2. Schema extraction β py_get_table_schema() returns the Arrow schema with all bio.* metadata without reading any data. This is used to derive the Polars schema for the IO plugin.
3. Metadata parsing β extract_all_schema_metadata() decodes Arrow metadata into format-specific structures (VCF headers, sample names, etc.). See Metadata Flow for details.
4. IO plugin registration β _lazy_scan() calls Polars' register_io_source(), providing:
- A schema (dict of column name β Polars dtype, derived from the Arrow schema)
- A callback function (
_overlap_source) that receives pushdown hints from Polars' query optimizer
5. Metadata attachment β Coordinate system and source metadata are set on the returned LazyFrame via config_meta.
The IO plugin callback (_overlap_source)
When Polars collects the LazyFrame (.collect() or streaming), it calls the registered _overlap_source callback with four parameters provided by the Polars engine:
def _overlap_source(
with_columns: pl.Expr | None, # Requested columns (projection)
predicate: pl.Expr | None, # Filter expression
n_rows: int | None, # Row limit
_batch_size: int | None, # Batch size hint
) -> Iterator[pl.DataFrame]:
The callback implements three pushdown optimizations, falling back to client-side evaluation when pushdown fails:
Predicate pushdown
Predicate pushdown translates Polars filter expressions into DataFusion predicates so that filtering happens inside the DataFusion execution engine β before data reaches Python.
The translation has two levels:
Level 1: Index-based region pruning β When a genomic region filter is present (e.g., pl.col("chrom") == "chr1") and an index file exists, the upstream TableProvider uses the index to skip irrelevant file regions entirely. This is handled automatically by DataFusion's plan optimization β the predicate in the DataFusion DataFrame.filter() is propagated to the TableProvider::scan() method which performs partition pruning.
Level 2: Record-level filtering β For non-genomic predicates (e.g., pl.col("mapping_quality") >= 30), the filter is evaluated per-record during parsing, before Arrow RecordBatch construction.
The translation pipeline is:
Polars Expr (pl.col("chrom") == "chr1")
β translate_predicate()
DataFusion Expr (col("chrom") == lit("chr1"))
β datafusion_expr_to_sql()
SQL string ("chrom = 'chr1'")
β query_df.parse_sql_expr()
Binding-compatible DataFusion Expr (same PyO3 compilation unit)
β query_df.filter()
DataFusion DataFrame with predicate
Why the SQL roundtrip?
The polars_bio Rust crate and the pip install datafusion Python package are separate PyO3 compilation units. Their Expr types are not interchangeable at the Python level. The bridge goes through parse_sql_expr(), which creates an Expr from the same compilation unit as the DataFrame, avoiding type mismatches.
Format-aware validation β The predicate translator knows the column types for each format and validates operator compatibility:
| Column type | Supported operators |
|---|---|
String (chrom, name, ...) |
==, !=, IN, NOT IN |
Numeric (start, end, mapping_quality, ...) |
==, !=, <, <=, >, >=, BETWEEN, IN, NOT IN |
| Unknown (BAM tags, VCF INFO fields, GFF attributes) | All β DataFusion type-checks at execution |
| Compound | AND combinations of the above |
Safety net β If translation fails (unsupported expression, type error), the callback logs a warning and applies the filter client-side on each batch after streaming:
if predicate is not None and not datafusion_predicate_applied:
out = out.filter(predicate) # client-side fallback
Projection pushdown
When Polars' optimizer determines only a subset of columns is needed (e.g., .select(["name", "chrom"])), the with_columns parameter carries the requested column names. The callback pushes this down to DataFusion:
select_exprs = [query_df.parse_sql_expr(f'"{c}"') for c in requested_cols]
query_df = query_df.select(*select_exprs)
This propagates through the DataFusion plan to the TableProvider, where parsing-level projection skips fields during record decoding:
- BAM: Unprojected fields (sequence, quality scores, CIGAR, tags) are not decoded
- VCF: Unprojected INFO/FORMAT fields are not parsed
- GFF: Attribute fields are only parsed if requested (GFF re-registers the table with
attr_fieldswhen specific attribute columns are projected)
Safety net β If DataFusion projection fails, client-side select is applied:
if with_columns is not None and not datafusion_projection_applied:
out = out.select(with_columns) # client-side fallback
Limit pushdown
When a .head(n) or .limit(n) is applied, Polars passes n_rows to the callback. This is pushed to DataFusion:
DataFusion propagates the limit to the TableProvider, which stops reading after enough records are produced. An additional client-side limit guard tracks remaining rows across batches.
Streaming execution
After all pushdowns are applied, the callback executes the query as a stream of Arrow RecordBatches:
df_stream = query_df.execute_stream()
for r in df_stream:
out = pl.DataFrame(r.to_pyarrow())
# Apply client-side fallbacks if needed
yield out
Each yielded pl.DataFrame batch becomes part of the Polars LazyFrame execution graph. Polars can then apply further operations (joins, aggregations, sinks) on the stream.
GFF special handling
GFF is the only format where projection affects table registration. GFF files have a semi-structured attributes column containing key=value pairs. When specific attribute columns are projected (e.g., gene_id, transcript_id), the callback re-registers the table with those attr_fields so the TableProvider parses them into individual columns:
if input_format == InputFormat.Gff and attr_fields:
gff_opts = GffReadOptions(attr_fields=attr_fields, ...)
table_obj = py_register_table(_ctx, file_path, None, InputFormat.Gff, ropts)
table_to_query = table_obj.name
Polars Integration
polars-bio leverages deep integration with Polars through the Arrow C Data Interface, enabling high-performance zero-copy data exchange between Polars LazyFrames and the Rust-based genomic range operations engine.
Architecture
flowchart LR
subgraph Python["Python Layer"]
LF["Polars LazyFrame"]
DF["Polars DataFrame"]
end
subgraph FFI["Arrow C Data Interface"]
stream["Arrow C Stream<br/>(__arrow_c_stream__)"]
end
subgraph Rust["Rust Layer (polars-bio)"]
reader["ArrowArrayStreamReader"]
datafusion["DataFusion Engine"]
range_ops["Range Operations<br/>(overlap, nearest, etc.)"]
end
LF --> |"ArrowStreamExportable"| stream
DF --> |"to_arrow()"| stream
stream --> |"Zero-copy FFI"| reader
reader --> datafusion
datafusion --> range_ops
How It Works
When you pass a Polars LazyFrame to range operations like overlap() or nearest():
- Stream Export: The LazyFrame exports itself as an Arrow C Stream via
collect_batches(lazy=True)._inner.__arrow_c_stream__()(Polars >= 1.37.0) - Zero-Copy Transfer: The stream pointer is passed directly to Rust - no data copying or Python object conversion
- GIL-Free Execution: Once the stream is exported, all data processing happens in Rust without holding Python's GIL
- Streaming Execution: Data flows through DataFusion's streaming engine, processing batches on-demand
Performance Benefits
| Aspect | Previous Approach | Arrow C Stream |
|---|---|---|
| GIL acquisition | Per batch | Once at export |
| Data conversion | Polars -> PyArrow -> Arrow | Direct FFI |
| Memory overhead | Python iterator objects | None |
| Batch processing | Python __next__() calls |
Native Rust iteration |
Requirements
- Polars >= 1.37.0 (required for
ArrowStreamExportable)
Batch Size Configuration
polars-bio automatically synchronizes the batch size between Polars streaming and DataFusion execution. When you set datafusion.execution.batch_size, Polars' collect_batches() will use the same chunk size:
import polars_bio as pb
# Set batch size for both Polars and DataFusion
pb.set_option("datafusion.execution.batch_size", "8192")
# Now LazyFrame streaming uses 8192-row batches
# This ensures consistent memory usage and processing patterns
| Setting | Effect |
|---|---|
datafusion.execution.batch_size |
Controls batch size for both Polars streaming export and DataFusion processing |
| Default | 8192 rows (synchronized between Polars and DataFusion) |
"65536" |
Larger batches for high-throughput scenarios |
Tip
Matching batch sizes between Polars and DataFusion improves cache locality and reduces memory fragmentation when processing large datasets.
Example
import polars as pl
import polars_bio as pb
# Create a LazyFrame from a large file
lf1 = pl.scan_parquet("variants.parquet")
lf2 = pl.scan_parquet("regions.parquet")
# Set coordinate system metadata
lf1 = lf1.config_meta.set(coordinate_system_zero_based=True)
lf2 = lf2.config_meta.set(coordinate_system_zero_based=True)
# Range operation uses Arrow C Stream for efficient data transfer
result = pb.overlap(
lf1, lf2,
cols1=["chrom", "start", "end"],
cols2=["chrom", "start", "end"],
output_type="polars.LazyFrame"
)
# Execute with Polars streaming engine
result.collect(engine="streaming")
Metadata Flow: Polars β Arrow
polars-bio maintains rich metadata (coordinate system, source format, file headers) across the boundary between Polars DataFrames and the Arrow/DataFusion layer. This section describes how metadata is stored, transferred, and preserved.
Overview
flowchart TB
subgraph Read["Read Path (file β Polars)"]
direction TB
file["Genomic File<br/>(BAM, VCF, GFF, ...)"]
tp["TableProvider<br/>(Rust, datafusion-bio-formats)"]
arrow_schema["Arrow Schema<br/>with bio.* metadata"]
extract["extract_all_schema_metadata()<br/>(Python)"]
polars_meta["Polars config_meta<br/>(coordinate_system, source_format,<br/>source_path, source_header)"]
file --> tp
tp --> |"py_get_table_schema()"| arrow_schema
arrow_schema --> extract
extract --> |"set_coordinate_system()<br/>set_source_metadata()"| polars_meta
end
subgraph Write["Write Path (Polars β file)"]
direction TB
polars_meta_w["Polars config_meta<br/>(source_header JSON)"]
arrow_schema_w["Arrow Schema<br/>with VCF field metadata"]
tp_w["TableProvider::insert_into()<br/>(Rust)"]
file_w["Output File"]
polars_meta_w --> |"apply_vcf_metadata_to_schema()"| arrow_schema_w
arrow_schema_w --> tp_w
tp_w --> file_w
end
Arrow-side metadata (DataFusion/Rust)
When a TableProvider registers a file, the upstream datafusion-bio-formats crates embed format-specific metadata into the Arrow schema using bio.*-prefixed keys:
| Key prefix | Format | Examples |
|---|---|---|
bio.coordinate_system_zero_based |
All | "true" or "false" |
bio.vcf.file_format |
VCF | "VCFv4.2" |
bio.vcf.contigs |
VCF | JSON array of contig definitions |
bio.vcf.filters |
VCF | JSON array of filter definitions |
bio.vcf.field.* |
VCF | Per-field metadata (field_type, number, type, description) |
bio.bam.* |
BAM/SAM/CRAM | Format-specific metadata |
bio.gff.* |
GFF | Format-specific metadata |
bio.fastq.* |
FASTQ | Format-specific metadata |
The Arrow schema is extracted without materializing any data via py_get_table_schema(), which calls table.schema().as_arrow() in Rust.
Polars-side metadata (Python)
On the Python side, metadata is stored using the polars-config-meta library, which attaches key-value pairs to Polars DataFrames and LazyFrames via config_meta:
| Key | Type | Description |
|---|---|---|
coordinate_system_zero_based |
bool |
True = 0-based half-open, False = 1-based closed |
source_format |
str |
Format identifier ("vcf", "bam", "gff", ...) |
source_path |
str |
Original file path |
source_header |
str |
JSON-encoded format-specific header data |
For Pandas DataFrames, the same keys are stored in df.attrs.
Read path: Arrow β Polars
When scan_*() or read_*() is called, the I/O pipeline transfers metadata in three steps:
- Register table β
py_register_table()creates a DataFusionTableProviderthat embedsbio.*metadata in the Arrow schema - Extract schema β
py_get_table_schema()returns the PyArrow schema;extract_all_schema_metadata()parses allbio.*keys into a structured dict with format-specific parsers (VCF fields, sample names, contigs, etc.) - Set Polars metadata β
set_coordinate_system()andset_source_metadata()attach the parsed metadata to the LazyFrame viaconfig_meta
# Internally, _read_file() does:
table = py_register_table(ctx, path, None, input_format, read_options)
schema = py_get_table_schema(ctx, table.name) # Arrow schema with bio.* metadata
full_metadata = extract_all_schema_metadata(schema) # Parse into structured dict
lf = _lazy_scan(schema, ...) # Create LazyFrame
set_coordinate_system(lf, zero_based) # Attach to config_meta
set_source_metadata(lf, format="vcf", path=path, header=header_metadata)
Write path: Polars β Arrow
When writing files (e.g., write_vcf(), sink_bam()), metadata flows back from the Polars DataFrame to the Arrow schema:
- Extract Polars metadata β
get_metadata()readssource_headerJSON fromconfig_meta - Apply to Arrow schema β
apply_vcf_metadata_to_schema()(Rust) converts the JSON metadata back into Arrow field-level metadata (bio.vcf.field.*keys) - Write with headers β The
TableProvider::insert_into()implementation uses the schema metadata to write proper file headers (VCF INFO/FORMAT definitions, BAM@SQ/@RG/@PGrecords, etc.)
Metadata preservation
Polars metadata set via config_meta is preserved through most Polars operations:
lf = pb.scan_vcf("variants.vcf")
filtered = lf.filter(pl.col("qual") > 30) # metadata preserved
selected = lf.select(["chrom", "start"]) # metadata preserved
limited = lf.head(100) # metadata preserved
# All share the same metadata
assert pb.get_metadata(lf)["format"] == pb.get_metadata(filtered)["format"] # "vcf"
Registered table metadata
For registered DataFusion tables, coordinate system metadata can also be read directly from the Arrow schema:
from polars_bio.context import ctx
pb.register_vcf("variants.vcf", name="my_variants")
df = ctx.table("my_variants")
schema = df.schema()
print(schema.metadata) # {b'bio.coordinate_system_zero_based': b'false', ...}
Coordinate System Internals
polars-bio supports both 0-based half-open [start, end) and 1-based closed [start, end] coordinate systems. The user-facing API is documented in Features β Coordinate systems support. This section describes how the coordinate system propagates through the internal pipeline and affects query semantics.
End-to-end flow
flowchart LR
subgraph IO["I/O Layer"]
scan["scan_*(use_zero_based=...)"]
end
subgraph Meta["Metadata Layer"]
config["config_meta<br/>coordinate_system_zero_based"]
validate["validate_coordinate_systems()"]
end
subgraph Python["Range Operations (Python)"]
filter_op["FilterOp.Strict<br/>(0-based)<br/>FilterOp.Weak<br/>(1-based)"]
end
subgraph Rust["Range Operations (Rust)"]
upstream["RangesFilterOp"]
sql_sign["SQL join condition:<br/>Strict β '>' (exclusive end)<br/>Weak β '>=' (inclusive end)"]
end
scan --> config
config --> validate
validate --> filter_op
filter_op --> |"PyO3"| upstream
upstream --> sql_sign
How it works
-
I/O sets metadata β When
scan_*()orread_*()is called,set_coordinate_system(lf, zero_based)stores the coordinate system inconfig_meta. Theuse_zero_basedparameter defaults to the global session settingdatafusion.bio.coordinate_system_zero_based. -
Validation at operation time β Before every range operation (
overlap,nearest,count_overlaps,coverage,merge),validate_coordinate_systems()readsconfig_metafrom both input DataFrames and checks they match. In strict mode (coordinate_system_check=true), missing metadata raisesMissingCoordinateSystemError. In lenient mode (default), it falls back to the global setting with a warning. -
Coordinate system β FilterOp β The validated boolean is mapped to a
FilterOpenum:zero_basedFilterOpInterval semantics TrueStrict[start, end)β end is exclusiveFalseWeak[start, end]β end is inclusive -
FilterOp β SQL join condition β In Rust,
FilterOpcontrols the comparison operator in the interval overlap predicate:- Strict (0-based):
a.start < b.end AND a.end > b.startβ strict inequalities because end positions are exclusive - Weak (1-based):
a.start <= b.end AND a.end >= b.startβ weak inequalities because end positions are inclusive
- Strict (0-based):
Storage locations
The coordinate system is stored in different places depending on the context:
| Context | Storage | Key |
|---|---|---|
| Polars DataFrame/LazyFrame | config_meta (polars-config-meta) |
coordinate_system_zero_based |
| Pandas DataFrame | df.attrs |
coordinate_system_zero_based |
| DataFusion registered table | Arrow schema metadata | bio.coordinate_system_zero_based |
| Session default | DataFusion config | datafusion.bio.coordinate_system_zero_based |
Session parameters
| Parameter | Default | Description |
|---|---|---|
datafusion.bio.coordinate_system_zero_based |
"false" (1-based) |
Global default applied when use_zero_based is not specified |
datafusion.bio.coordinate_system_check |
"false" (lenient) |
"true" = raise error on missing metadata; "false" = fall back to global default with warning |
Format defaults
Different file formats have conventional coordinate systems. polars-bio uses the global default (1-based) for all formats, but users can override at I/O time with use_zero_based=True:
| Format | Convention | polars-bio default |
|---|---|---|
| VCF | 1-based | 1-based |
| GFF | 1-based | 1-based |
| SAM/BAM/CRAM | 1-based | 1-based |
| BED | 0-based | 1-based (override with use_zero_based=True) |
| Pairs | 0-based or 1-based | 1-based |
Warning
BED format conventionally uses 0-based coordinates. If your pipeline mixes BED files with VCF/BAM data, be sure to read BED files with use_zero_based=True or set the global default accordingly β otherwise, range operations will produce incorrect results.
Range Operation Algorithm
polars-bio implements a set of interval operations on genomic ranges, including binary operations (overlap, nearest, count-overlaps, coverage, subtract) and unary operations (merge, cluster, complement). The binary operations share a very similar algorithmic structure, which is presented in the diagram below. The unary operations (merge, cluster, complement) take a single set of intervals and produce transformed output β merged intervals, cluster assignments, or gap intervals respectively.
flowchart TB
%% Define header node
H["Interval operation"]
%% Define DataFrame nodes
I0["left DataFrame"]
I1["right DataFrame"]
style I0 stroke-dasharray: 5 5, stroke-width: 1
%% Draw edges with labels
H -->|probe /streaming/ side| I0
H -->|build /search structure/ side| I1
%% Record batches under left DataFrame within a dotted box
I0 --> LeftGroup
subgraph LeftGroup["Record Batches"]
direction TB
LB0["Batch 1"]
LB1["Batch 2"]
LB2["Batch 3"]
end
style LeftGroup stroke-dasharray: 5 5, stroke-width: 1
%% Record batches under right DataFrame within a dotted box
I1 --> RightGroup
subgraph RightGroup["Record Batches"]
direction TB
RB0["Batch 1"]
RB1["Batch 2"]
RB2["Batch 3"]
end
The basic concept is that each operation consists of two sides: the probe side and the build side. The probe side is the one that is streamed, while the build side is the one that is implemented as a search data structure (for generic overlap operation the search structure can be changed using algorithm parameter, for other operations is always Cache Oblivious Interval Trees as according to the benchmark COITrees outperforms other data structures). In the case of nearest operation there is an additional sorted list of intervals used for searching for closest intervals in the case of non-existing overlaps.
Note
Available search structure implementations for overlap operation:
- COITrees
- IITree
- AVL-tree
- rust-lapper
- superintervals - available since
polars-bioversion0.12.0
Once the build side data structure is ready, then records from the probe side are processed against the search structure organized as record batches. Each record batch can be processed independently. Search structure nodes contains identifiers of the rows from the build side that are then used to construct a new record that is returned as a result of the operation.
Out-of-core (streaming) processing
This algorithm allows you to process your results without requiring all your data to be in memory at the same time. In particular, the probe side can be streamed from a file stored locally or on a cloud object storage, while the build side needs to be fully materialized in memory. In real applications, the probe side is usually a large file with genomic intervals, while the build side is a smaller file with annotations or other genomic features. This allows you to process large genomic datasets without running out of memory.
Note
- In this sense, the order of the sides is important, as the probe side is streamed and processed in batches, while the build side is fully materialized in memory.
- The smaller the build side and larger the number of overlaps are, the higher is the gain of memory efficiency. For instance, when we compare the real
8-7(10^7 vs. 1.2*10^6) and synthetic (10^7 vs. 10^7) datasets, we can see that we benefit more from using streaming mode in the former benchmark.
Parallelization
In the current implementation, the probe side can be processed in parallel using multiple threads on partitioned (implicitly or explicitly partitioned inputs β see parallel engine). The build side is predominantly single-threaded (with the notable exception of BGZF compressed or partitioned Parquet/CSV input data files reading, which can be parallelized).
Comparison with Existing Tools
The table below compares polars-bio with other popular Python libraries for genomic ranges operations.
| Feature/Library | polars-bio | Bioframe | PyRanges0 | PyRanges1 | pybedtools | GenomicRanges |
|---|---|---|---|---|---|---|
| out-of-core processing | β | β | β | β | β | β |
| parallel processing | β | β | β 1 | β | β | β |
| vectorized execution engine | β | β | β | β | β | β |
| cloud object storage support | β | β /β2 | β | β | β | β |
| Pandas/Polars DataFrame support | β /β | β /β | β /β3 | β /β4 | β/β | β /β |
Note
1 PyRanges0 supports parallel processing with Ray, but it does not bring any performance benefits over single-threaded execution and it is not recommended.
2 Some input functions, such as read_table support cloud object storage
3 Only export/import with data copying is supported
4 RangeFrame class extends Pandas DataFrame
DataFusion Extension Points
polars-bio uses the following Apache DataFusion extension points:
PhysicalOptimizerRule for interval join rewriting
DefaultPhysicalPlanner and PhysicalOptimizerRule are used for detecting and rewriting generic interval join operations (i.e. overlap and nearest) with optimized execution strategies. This is implemented in datafusion-bio-function-ranges which exposes optimized interval join operations for Apache DataFusion with both SQL and DataFrame APIs.
TableProvider + UDTF for specialized operations
TableProvider and User-Defined Table Function mechanisms are used for implementing specialized operations:
- coverage and count-overlaps: Implemented as UDTFs in
datafusion-bio-function-ranges - depth (pileup): Implemented as a UDTF in
datafusion-bio-function-pileup, registered asdepth()in SQL
import polars_bio as pb
# Coverage via UDTF
result = pb.coverage(df1, df2)
# Depth via SQL UDTF
result = pb.sql("SELECT * FROM depth('alignments.bam')").collect()
Predicate & Projection Pushdown Examples
This section provides concrete usage examples and execution plan inspection for the pushdown optimizations described in the I/O Pipeline section above.
Record-level filter pushdown
Beyond index-based region queries, all formats support record-level predicate evaluation. Filters on columns like mapping_quality, flag, score, or strand are evaluated as each record is read, filtering early before Arrow RecordBatch construction.
This works with or without an index file:
import polars as pl
import polars_bio as pb
# No index needed β filters applied per-record during scan
df = (
pb.scan_bam("alignments.bam")
.filter((pl.col("mapping_quality") >= 30) & (pl.col("flag") & 4 == 0))
.collect()
)
# Combine genomic region (uses index) with record filter (applied per-record)
df = (
pb.scan_bam("alignments.bam")
.filter(
(pl.col("chrom") == "chr1")
& (pl.col("start") >= 1000000)
& (pl.col("mapping_quality") >= 30)
)
.collect()
)
Projection pushdown
BAM, CRAM, VCF, and Pairs formats support parsing-level projection pushdown. When you select a subset of columns, unprojected fields are skipped entirely during record parsing β no string formatting, sequence decoding, map lookups, or memory allocation for those fields. This can significantly reduce I/O and CPU time, especially for wide schemas like BAM (11+ columns) where you only need a few fields.
Projection pushdown is enabled by default (projection_pushdown=True) on all scan_*/read_* calls and range operations. To disable it, pass projection_pushdown=False.
import polars_bio as pb
# Only name and chrom are parsed from each BAM record (projection pushdown is on by default)
df = (
pb.scan_bam("alignments.bam")
.select(["name", "chrom"])
.collect()
)
# Works the same for CRAM and VCF
df = pb.scan_cram("alignments.cram").select(["name", "chrom"]).collect()
# Works with SQL too β only referenced columns are parsed
pb.register_vcf("variants.vcf.gz", "variants")
result = pb.sql("SELECT chrom, start FROM variants").collect()
Inspecting the execution plan
You can verify pushdown is active by inspecting the physical execution plan:
from polars_bio.context import ctx
from polars_bio.polars_bio import (
InputFormat, ReadOptions, BamReadOptions,
py_register_table, py_read_table,
)
read_options = ReadOptions(bam_read_options=BamReadOptions())
table = py_register_table(ctx, "alignments.bam", None, InputFormat.Bam, read_options)
df = py_read_table(ctx, table.name).select_columns("name", "chrom")
print(df.execution_plan())
# CooperativeExec
# BamExec: projection=[name, chrom] <-- only 2 of 11 columns parsed
Tip
COUNT(*) queries also benefit β when no columns are needed, the empty projection path avoids parsing any fields while still counting records correctly.
Building & Development
Building from source
polars-bio can be built from source using maturin and poetry:
git clone https://github.com/biodatageeks/polars-bio.git
cd polars-bio
poetry env use 3.12
poetry update
RUSTFLAGS="-Ctarget-cpu=native" maturin build --release -m Cargo.toml
pip install target/wheels/polars_bio-*.whl
For development (installs directly into the current virtual environment):
Required dependencies
Running tests
Unit and integration tests use pytest:
# Run all tests
python -m pytest tests/ -v
# Run specific test modules
python -m pytest tests/test_io_bam.py -v
python -m pytest tests/test_io_vcf.py -v
For cloud storage integration tests (requires azure-cli and docker):
cd it
source bin/start.sh
JUPYTER_PLATFORM_DIRS=1 pytest it_object_storage_io.py -o log_cli=true --log-cli-level=INFO
source bin/stop.sh
Check the README in the it directory for more information.
Building documentation
The documentation is built with MkDocs and the Material theme:
# Full build with all dynamic content
MKDOCS_EXPORTER_PDF=false JUPYTER_PLATFORM_DIRS=1 mkdocs serve -w polars_bio
To speed up development builds by disabling dynamic content rendering:
MKDOCS_EXPORTER_PDF=false ENABLE_MD_EXEC=false ENABLE_MKDOCSTRINGS=false ENABLE_JUPYTER=false JUPYTER_PLATFORM_DIRS=1 mkdocs serve
Rust compilation check
For a quick compilation check without building the full wheel: