π¨Features
Genomic ranges operations
| Features | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
|---|---|---|---|---|---|---|
| overlap | ||||||
| nearest | ||||||
| count_overlaps | ||||||
| cluster | ||||||
| merge | ||||||
| complement | ||||||
| coverage | ||||||
| expand | ||||||
| sort | ||||||
| read_table |
Limitations
For now polars-bio uses int32 positions encoding for interval operations (issue) meaning that it does not support operation on chromosomes longer than 2Gb. int64 support is planned for future releases (issue).
Memory Optimization
For overlap() operations that produce very large result sets, use the low_memory=True parameter to reduce peak memory consumption:
This enables streaming output generation that caps the output batch size, trading some performance for significantly lower memory usage.
Coordinate systems support
polars-bio supports both 0-based half-open and 1-based closed coordinate systems for genomic ranges operations. By default, it uses 1-based closed coordinates, which is the native format for VCF, GFF, and SAM/BAM files.
How it works
The coordinate system is managed through DataFrame metadata that is set at I/O time and read by range operations. This ensures consistency throughout your analysis pipeline.
flowchart TB
subgraph IO["I/O Layer"]
scan["scan_vcf/gff/bam/cram/bed()"]
read["read_vcf/gff/bam/cram/bed()"]
end
subgraph Config["Session Configuration"]
zero_based["datafusion.bio.coordinate_system_zero_based<br/>(default: false = 1-based)"]
check["datafusion.bio.coordinate_system_check<br/>(default: false = lenient)"]
end
subgraph DF["DataFrame with Metadata"]
polars_meta["Polars DataFrame/LazyFrame<br/>coordinate_system_zero_based"]
pandas_meta["Pandas DataFrame<br/>df.attrs"]
end
subgraph RangeOps["Range Operations"]
overlap["overlap()"]
nearest["nearest()"]
count["count_overlaps()"]
coverage["coverage()"]
merge["merge()"]
end
subgraph Validation["Metadata Validation"]
validate["validate_coordinate_systems()"]
error1["MissingCoordinateSystemError"]
error2["CoordinateSystemMismatchError"]
fallback["Fallback to global config<br/>+ emit warning"]
end
scan --> |"sets metadata"| polars_meta
read --> |"sets metadata"| polars_meta
zero_based --> |"use_zero_based param<br/>or default"| scan
zero_based --> |"use_zero_based param<br/>or default"| read
polars_meta --> overlap
polars_meta --> nearest
polars_meta --> count
polars_meta --> coverage
polars_meta --> merge
pandas_meta --> overlap
overlap --> validate
nearest --> validate
count --> validate
coverage --> validate
merge --> validate
validate --> |"metadata missing"| check
validate --> |"metadata mismatch"| error2
check --> |"true (strict)"| error1
check --> |"false (lenient)"| fallback
fallback --> zero_based
Session parameters
polars-bio provides two session parameters to control coordinate system behavior:
| Parameter | Default | Description |
|---|---|---|
datafusion.bio.coordinate_system_zero_based |
"false" (1-based) |
Default coordinate system for I/O operations when use_zero_based is not specified |
datafusion.bio.coordinate_system_check |
"false" (lenient) |
Whether to raise an error when DataFrame metadata is missing |
import polars_bio as pb
# Check current settings
print(pb.get_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED)) # "false"
print(pb.get_option(pb.POLARS_BIO_COORDINATE_SYSTEM_CHECK)) # "false"
# Change to 0-based coordinates globally
pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)
Reading files with coordinate system metadata
When you read genomic files using polars-bio I/O functions, the coordinate system metadata is automatically set on the returned DataFrame:
import polars_bio as pb
# Default: 1-based coordinates (use_zero_based=False)
df = pb.scan_vcf("variants.vcf")
# Metadata is automatically set: coordinate_system_zero_based=False
# Explicit 0-based coordinates
df_zero = pb.scan_bed("regions.bed", use_zero_based=True)
# Metadata is automatically set: coordinate_system_zero_based=True
# Range operations read coordinate system from metadata
result = pb.overlap(df, df_zero, ...) # Raises CoordinateSystemMismatchError!
Setting metadata on DataFrames
For DataFrames not created via polars-bio I/O functions, you must set the coordinate system metadata manually:
import polars as pl
# Create a DataFrame
df = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [100, 200],
"end": [150, 250]
}).lazy()
# Set coordinate system metadata (requires polars-config-meta)
df = df.config_meta.set(coordinate_system_zero_based=False) # 1-based
# Now it can be used with range operations
result = pb.overlap(df, other_df, ...)
import pandas as pd
# Create a DataFrame
pdf = pd.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [100, 200],
"end": [150, 250]
})
# Set coordinate system metadata via df.attrs
pdf.attrs["coordinate_system_zero_based"] = False # 1-based
# Now it can be used with range operations
result = pb.overlap(pdf, other_df, output_type="pandas.DataFrame", ...)
Error handling
polars-bio raises specific errors to prevent coordinate system mismatches:
MissingCoordinateSystemError
Raised when a DataFrame lacks coordinate system metadata:
import polars as pl
import polars_bio as pb
# DataFrame without metadata
df = pl.DataFrame({"chrom": ["chr1"], "start": [100], "end": [200]}).lazy()
# This raises MissingCoordinateSystemError
pb.overlap(df, other_df, ...)
How to fix: Set metadata on your DataFrame before passing it to range operations (see examples above).
CoordinateSystemMismatchError
Raised when two DataFrames have different coordinate systems:
import polars_bio as pb
# One DataFrame is 1-based, another is 0-based
df1 = pb.scan_vcf("file.vcf") # 1-based (default)
df2 = pb.scan_bed("file.bed", use_zero_based=True) # 0-based
# This raises CoordinateSystemMismatchError
pb.overlap(df1, df2, ...)
How to fix: Ensure both DataFrames use the same coordinate system.
Default behavior (lenient validation)
By default, polars-bio uses lenient validation (coordinate_system_check=false). When a DataFrame lacks coordinate system metadata, it falls back to the global configuration and emits a warning:
import polars as pl
import polars_bio as pb
# DataFrames without metadata will use the global config with a warning
df = pl.DataFrame({"chrom": ["chr1"], "start": [100], "end": [200]}).lazy()
result = pb.overlap(df, other_df, ...) # Uses global coordinate system setting
# Warning: Coordinate system metadata is missing. Using global config...
Strict mode
For production pipelines where coordinate system consistency is critical, you can enable strict validation:
import polars_bio as pb
# Enable strict coordinate system check
pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_CHECK, True)
# Now DataFrames without metadata will raise MissingCoordinateSystemError
Tip
Enable strict mode in production pipelines to catch coordinate system mismatches early and prevent incorrect results.
Migration from previous versions
If you're upgrading from a previous version of polars-bio:
- Range operations no longer accept
use_zero_basedparameter - coordinate system is read from DataFrame metadata - I/O functions use
use_zero_basedparameter (renamed fromone_basedwith inverted logic) - Pandas DataFrames require explicit metadata - set
df.attrs["coordinate_system_zero_based"]before range operations
# Before (old API)
result = pb.overlap(df1, df2, use_zero_based=True, ...)
# After (new API) - set metadata at I/O time or on DataFrames
df1 = pb.scan_vcf("file.vcf", use_zero_based=True)
df2 = pb.scan_bed("file.bed", use_zero_based=True)
result = pb.overlap(df1, df2, ...) # Reads from metadata
File Metadata
polars-bio automatically attaches comprehensive metadata to DataFrames when reading genomic files. This metadata includes format information, coordinate systems, and format-specific details like VCF header fields.
Metadata Structure
The metadata is stored in a clean, user-friendly structure:
import polars_bio as pb
lf = pb.scan_vcf("variants.vcf")
meta = pb.get_metadata(lf)
# Returns:
{
"format": "vcf", # File format
"path": "variants.vcf", # Source file path
"coordinate_system_zero_based": False, # Coordinate system (VCF is 1-based)
"header": {
"version": "VCFv4.2", # VCF version
"sample_names": ["Sample1", "Sample2"], # Sample names
"info_fields": { # INFO field definitions
"AF": {
"number": "A",
"type": "Float",
"description": "Allele Frequency",
"id": "AF"
}
},
"format_fields": { # FORMAT field definitions
"GT": {
"number": "1",
"type": "String",
"description": "Genotype"
}
},
"contigs": [...], # Contig definitions
"filters": [...], # Filter definitions
"_datafusion_table_name": "variants" # Internal table name (for debugging)
}
}
Accessing Metadata
polars-bio provides three main functions for working with metadata:
1. Get all metadata as a dictionary
import polars_bio as pb
lf = pb.scan_vcf("file.vcf")
meta = pb.get_metadata(lf)
# Access different parts
print(meta["format"]) # "vcf"
print(meta["path"]) # "file.vcf"
print(meta["coordinate_system_zero_based"]) # False (1-based)
# Access VCF-specific fields
print(meta["header"]["version"]) # "VCFv4.2"
print(meta["header"]["sample_names"]) # ["Sample1", "Sample2"]
# Access INFO field definitions
af_field = meta["header"]["info_fields"]["AF"]
print(af_field["type"]) # "Float"
print(af_field["description"]) # "Allele Frequency"
# Access FORMAT field definitions
gt_field = meta["header"]["format_fields"]["GT"]
print(gt_field["type"]) # "String"
2. Print metadata as formatted JSON
import polars_bio as pb
lf = pb.scan_vcf("file.vcf")
# Print as pretty JSON
pb.print_metadata_json(lf)
# Customize indentation
pb.print_metadata_json(lf, indent=4)
3. Print human-readable summary
Output:
======================================================================
Metadata Summary
======================================================================
Format: vcf
Path: file.vcf
Coordinate System: 1-based
Format-specific metadata:
----------------------------------------------------------------------
VCF Version: VCFv4.2
Samples (3): Sample1, Sample2, Sample3
INFO fields: 5
- AF: Float (Allele Frequency)
- DP: Integer (Total Depth)
- AC: Integer (Allele Count)
FORMAT fields: 3
- GT: String (Genotype)
- DP: Integer (Read Depth)
- GQ: Integer (Genotype Quality)
======================================================================
Format-Specific Metadata
Different file formats include different metadata:
lf = pb.scan_vcf("variants.vcf")
meta = pb.get_metadata(lf)
# VCF header metadata
meta["header"]["version"] # VCF version
meta["header"]["sample_names"] # Sample names
meta["header"]["info_fields"] # INFO field definitions
meta["header"]["format_fields"] # FORMAT field definitions
meta["header"]["contigs"] # Contig definitions
meta["header"]["filters"] # Filter definitions
Setting Custom Metadata
You can set metadata on DataFrames created from other sources:
import polars as pl
import polars_bio as pb
# Create a DataFrame
df = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [100, 200],
"end": [150, 250]
}).lazy()
# Set metadata
pb.set_source_metadata(
df,
format="bed",
path="custom.bed",
header={"description": "Custom intervals"}
)
# Now metadata is available
meta = pb.get_metadata(df)
print(meta["format"]) # "bed"
print(meta["header"]["description"]) # "Custom intervals"
Metadata Preservation
Metadata is preserved through Polars operations:
lf = pb.scan_vcf("variants.vcf")
# Metadata persists after operations
filtered = lf.filter(pl.col("qual") > 30)
selected = lf.select(["chrom", "start", "end"])
limited = lf.head(100)
# All have the same metadata
meta1 = pb.get_metadata(lf)
meta2 = pb.get_metadata(filtered)
meta3 = pb.get_metadata(selected)
assert meta1["format"] == meta2["format"] == meta3["format"] # All "vcf"
Using Metadata for Debugging
The _datafusion_table_name field is useful for debugging DataFusion SQL queries:
lf = pb.scan_vcf("variants.vcf")
meta = pb.get_metadata(lf)
# Get internal table name
table_name = meta["header"]["_datafusion_table_name"]
print(f"Table name: {table_name}") # "variants"
# Use it in SQL queries for debugging
result = pb.sql(f"SELECT COUNT(*) FROM {table_name}")
API Reference
| Function | Description |
|---|---|
get_metadata(df) |
Get all metadata as a dictionary |
print_metadata_json(df, indent=2) |
Print metadata as formatted JSON |
print_metadata_summary(df) |
Print human-readable metadata summary |
set_source_metadata(df, format, path, header) |
Set metadata on a DataFrame |
API comparison between libraries
There is no standard API for genomic ranges operations in Python. This table compares the API of the libraries. The table is not exhaustive and only shows the most common operations used in benchmarking.
| operation | Bioframe | polars-bio | PyRanges0 | PyRanges1 | Pybedtools | GenomicRanges |
|---|---|---|---|---|---|---|
| overlap | overlap | overlap | join1 | join_ranges | intersect2 | find_overlaps3 |
| nearest | closest | nearest | nearest | nearest | closest4 | nearest5 |
| read_table | read_table | read_table | read_bed | read_bed | BedTool | read_bed |
Note
- There is an overlap method in PyRanges, but its output is only limited to indices of intervals from the other Dataframe that overlap. In Bioframe's benchmark also join method instead of overlap was used.
- wa and wb options used to obtain a comparable output.
- Output contains only a list with the same length as query, containing hits to overlapping indices. Data transformation is required to obtain the same output as in other libraries. Since the performance was far worse than in more efficient libraries anyway, additional data transformation was not included in the benchmark.
- s=first was used to obtain a comparable output.
- select="arbitrary" was used to obtain a comparable output.
File formats support
For bioinformatic format there are always three methods available: read_* (eager), scan_* (lazy) and register_* that can be used to either read file into Polars DataFrame/LazyFrame or register it as a DataFusion table for further processing using SQL or builtin interval methods. In either case, local and or cloud storage files can be used as an input. Please refer to cloud storage section for more details.
| Format | Single-threaded | Parallel (indexed) | Limit pushdown | Predicate pushdown | Projection pushdown |
|---|---|---|---|---|---|
| BED | β | β | β | ||
| VCF | |||||
| BAM | |||||
| CRAM | |||||
| FASTQ | β | β | |||
| FASTA | β | β | β | ||
| GFF3 | |||||
| Pairs |
Indexed reads & predicate pushdown
When an index file is present alongside the data file (BAI/CSI for BAM, CRAI for CRAM, TBI/CSI for VCF, GFF, and Pairs), polars-bio can push genomic region filters down to the DataFusion execution layer. This enables index-based random access β only the relevant genomic regions are read from disk, dramatically improving performance for selective queries on large files.
Index files are auto-discovered by convention. Predicate pushdown is enabled by default for BAM, CRAM, VCF, GFF, and Pairs formats β no extra configuration is needed.
Supported index formats
| Data Format | Index Formats | Naming Convention |
|---|---|---|
| BAM | BAI, CSI | sample.bam.bai or sample.bai, sample.bam.csi |
| CRAM | CRAI | sample.cram.crai |
| VCF (bgzf) | TBI, CSI | sample.vcf.gz.tbi, sample.vcf.gz.csi |
| GFF (bgzf) | TBI, CSI | sample.gff.gz.tbi, sample.gff.gz.csi |
| Pairs (bgzf) | TBI, CSI | contacts.pairs.gz.tbi, contacts.pairs.gz.csi |
| FASTQ (bgzf) | GZI | sample.fastq.bgz.gzi |
Usage with the scan/read API
Simply use .filter() β predicate pushdown is enabled by default for BAM, CRAM, VCF, GFF, and Pairs:
import polars as pl
import polars_bio as pb
# Single chromosome filter β only chr1 data is read from disk
df = (
pb.scan_bam("alignments.bam")
.filter(pl.col("chrom") == "chr1")
.collect()
)
# Multi-chromosome filter
df = (
pb.scan_vcf("variants.vcf.gz")
.filter(pl.col("chrom").is_in(["chr21", "chr22"]))
.collect()
)
# Region query β combines chromosome and coordinate filters
df = (
pb.scan_bam("alignments.bam")
.filter(
(pl.col("chrom") == "chr1")
& (pl.col("start") >= 10000)
& (pl.col("end") <= 50000)
)
.collect()
)
# CRAM with predicate pushdown
df = (
pb.scan_cram("alignments.cram")
.filter(pl.col("chrom") == "chr1")
.collect()
)
Tip
Predicate pushdown supports: equality (==), comparisons (>=, <=, >, <), is_in(), is_null(), is_not_null(), and combinations with & (AND). Complex predicates like .str.contains() or OR logic are automatically filtered client-side. To disable pushdown, pass predicate_pushdown=False.
Usage with the SQL API
The SQL path works automatically β DataFusion parses the WHERE clause and uses the index without any extra flags:
import polars_bio as pb
pb.register_bam("alignments.bam", "reads")
# Single chromosome
result = pb.sql("SELECT * FROM reads WHERE chrom = 'chr1'").collect()
# Region query
result = pb.sql(
"SELECT * FROM reads WHERE chrom = 'chr1' AND start >= 10000 AND \"end\" <= 50000"
).collect()
# Combined genomic and record filters
result = pb.sql(
"SELECT * FROM reads WHERE chrom = 'chr1' AND mapping_quality >= 30"
).collect()
Automatic parallel partitioning
When an index file is present, DataFusion distributes genomic regions across balanced partitions using index-derived size estimates, enabling parallel
execution. Formats with known contig lengths (BAM, CRAM) can split large regions into sub-regions for full parallelism even on single-chromosome queries. For FASTQ files, a GZI index alongside a BGZF-compressed file enables parallel decoding of compressed blocks. This is controlled by the global target_partitions setting:
import polars_bio as pb
pb.set_option("datafusion.execution.target_partitions", "8")
df = pb.read_bam("large_file.bam") # 8 partitions will be used for parallel execution
df = pb.read_fastq("reads.fastq.bgz") # parallel BGZF decoding when .gzi index is present
Partitioning behavior (BAM, CRAM, VCF, GFF):
| Index Available? | SQL Filters | Partitions |
|---|---|---|
| Yes | chrom = 'chr1' AND start >= 1000 |
up to target_partitions (region split into sub-regions) |
| Yes | chrom IN ('chr1', 'chr2') |
up to target_partitions (both regions split to fill bins) |
| Yes | mapping_quality >= 30 (no genomic filter) |
up to target_partitions (all chroms balanced + split) |
| Yes | None (full scan) | up to target_partitions (all chroms balanced + split) |
| No | Any | 1 (sequential full scan) |
Partitioning behavior (FASTQ):
| File type | GZI Index? | Partitions |
|---|---|---|
BGZF (.fastq.bgz) |
Yes (.fastq.bgz.gzi) |
up to target_partitions (parallel block decoding) |
BGZF (.fastq.bgz) |
No | 1 (sequential read) |
GZIP (.fastq.gz) |
N/A | 1 (sequential β GZIP cannot be parallelized) |
Uncompressed (.fastq) |
N/A | up to target_partitions (byte-range parallel) |
Record-level filter pushdown
Beyond index-based region queries, all formats support record-level predicate evaluation. Filters on columns like mapping_quality, flag, score, or strand are evaluated as each record is read, filtering early before Arrow RecordBatch construction.
This works with or without an index file:
import polars as pl
import polars_bio as pb
# No index needed β filters applied per-record during scan
df = (
pb.scan_bam("alignments.bam")
.filter((pl.col("mapping_quality") >= 30) & (pl.col("flag") & 4 == 0))
.collect()
)
# Combine genomic region (uses index) with record filter (applied per-record)
df = (
pb.scan_bam("alignments.bam")
.filter(
(pl.col("chrom") == "chr1")
& (pl.col("start") >= 1000000)
& (pl.col("mapping_quality") >= 30)
)
.collect()
)
Projection pushdown
BAM, CRAM, VCF, and Pairs formats support parsing-level projection pushdown. When you select a subset of columns, unprojected fields are skipped entirely during record parsing β no string formatting, sequence decoding, map lookups, or memory allocation for those fields. This can significantly reduce I/O and CPU time, especially for wide schemas like BAM (11+ columns) where you only need a few fields.
Projection pushdown is enabled by default (projection_pushdown=True) on all scan_*/read_* calls and range operations. To disable it, pass projection_pushdown=False.
import polars_bio as pb
# Only name and chrom are parsed from each BAM record (projection pushdown is on by default)
df = (
pb.scan_bam("alignments.bam")
.select(["name", "chrom"])
.collect()
)
# Works the same for CRAM and VCF
df = pb.scan_cram("alignments.cram").select(["name", "chrom"]).collect()
# Works with SQL too β only referenced columns are parsed
pb.register_vcf("variants.vcf.gz", "variants")
result = pb.sql("SELECT chrom, start FROM variants").collect()
You can verify pushdown is active by inspecting the physical execution plan:
from polars_bio.context import ctx
from polars_bio.polars_bio import (
InputFormat, ReadOptions, BamReadOptions,
py_register_table, py_read_table,
)
read_options = ReadOptions(bam_read_options=BamReadOptions())
table = py_register_table(ctx, "alignments.bam", None, InputFormat.Bam, read_options)
df = py_read_table(ctx, table.name).select_columns("name", "chrom")
print(df.execution_plan())
# CooperativeExec
# BamExec: projection=[name, chrom] <-- only 2 of 11 columns parsed
Tip
COUNT(*) queries also benefit β when no columns are needed, the empty projection path avoids parsing any fields while still counting records correctly.
Index file generation
Creating index files
Create index files using standard bioinformatics tools:
# BAM: sort and index
samtools sort input.bam -o sorted.bam
samtools index sorted.bam # creates sorted.bam.bai
# CRAM: sort and index
samtools sort input.cram -o sorted.cram --reference ref.fa
samtools index sorted.cram # creates sorted.cram.crai
# VCF: sort, compress, and index
bcftools sort input.vcf -Oz -o sorted.vcf.gz
bcftools index -t sorted.vcf.gz # creates sorted.vcf.gz.tbi
# GFF: sort, compress, and index
(grep "^#" input.gff; grep -v "^#" input.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz
tabix -p gff sorted.gff.gz # creates sorted.gff.gz.tbi
# Pairs: sort, compress, and index (col 2=chr1, col 3=pos1)
sort -k2,2 -k3,3n contacts.pairs | bgzip > contacts.pairs.gz
tabix -s 2 -b 3 -e 3 contacts.pairs.gz # creates contacts.pairs.gz.tbi
# FASTQ: BGZF compress and create GZI index for parallel reads
bgzip reads.fastq # creates reads.fastq.bgz
bgzip -r reads.fastq.bgz # creates reads.fastq.bgz.gzi
File Output
polars-bio supports writing DataFrames back to bioinformatic file formats. Two methods are available for each supported format:
write_*- Eager write that collects the DataFrame and writes it to disk, returns row countsink_*- Streaming write for LazyFrames that processes data in batches without full materialization
Output Format Support
| Format | write_* | sink_* | Compression | Notes |
|---|---|---|---|---|
| VCF | .vcf.gz, .vcf.bgz |
Auto-detected from extension | ||
| BAM | BGZF (built-in) | Binary alignment format | ||
| SAM | None | Plain text alignment format | ||
| CRAM | Built-in | Requires reference FASTA | ||
| FASTQ | .fastq.gz, .fastq.bgz |
Auto-detected from extension |
Basic Usage
import polars_bio as pb
# Read, transform, and write back
df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
filtered = df.filter(pl.col("mapping_quality") > 20)
pb.write_bam(filtered, "output.bam")
# Streaming write with LazyFrame
lf = pb.scan_vcf("variants.vcf")
pb.sink_vcf(lf.filter(pl.col("qual") > 30), "filtered.vcf.bgz")
Sorted Output with sort_on_write
BAM, SAM, and CRAM write functions support the sort_on_write parameter to produce coordinate-sorted output:
import polars_bio as pb
# Write coordinate-sorted BAM
df = pb.read_bam("unsorted.bam")
pb.write_bam(df, "sorted.bam", sort_on_write=True)
# Streaming sorted write
lf = pb.scan_sam("input.sam")
pb.sink_bam(lf, "sorted.bam", sort_on_write=True)
When sort_on_write=True:
- Records are sorted by
(chrom ASC, start ASC)during write - Output header contains
@HD ... SO:coordinate
When sort_on_write=False (default):
- Records are written in input order
- Output header contains
@HD ... SO:unsorted
CRAM Output
CRAM format requires a reference FASTA file for writing:
import polars_bio as pb
# CRAM write requires reference_path
df = pb.read_cram("input.cram", reference_path="reference.fa")
pb.write_cram(df, "output.cram", reference_path="reference.fa")
# Streaming CRAM write
lf = pb.scan_cram("input.cram", reference_path="reference.fa")
pb.sink_cram(lf, "output.cram", reference_path="reference.fa", sort_on_write=True)
Warning
The reference_path parameter is required for write_cram() and sink_cram(). Attempting to write CRAM without a reference will raise an error.
Output Compression
Output compression is auto-detected from the file extension for VCF and FASTQ formats:
| Extension | Compression |
|---|---|
.vcf / .fastq |
None (plain text) |
.vcf.gz / .fastq.gz |
GZIP |
.vcf.bgz / .fastq.bgz |
BGZF (block gzip) |
import polars_bio as pb
# VCF
df = pb.read_vcf("variants.vcf")
pb.write_vcf(df, "output.vcf") # plain text
pb.write_vcf(df, "output.vcf.gz") # GZIP
pb.write_vcf(df, "output.vcf.bgz") # BGZF (recommended for indexing)
# FASTQ
df = pb.read_fastq("reads.fastq")
pb.write_fastq(df, "output.fastq") # plain text
pb.write_fastq(df, "output.fastq.gz") # GZIP
pb.write_fastq(df, "output.fastq.bgz") # BGZF (recommended for parallel reads with GZI index)
# Streaming write
lf = pb.scan_fastq("large_reads.fastq.gz")
pb.sink_fastq(lf.limit(1000), "sample.fastq")
Header Preservation
When reading and writing alignment files (BAM/SAM/CRAM), polars-bio preserves header metadata including:
@SQ(sequence dictionary)@RG(read groups)@PG(program records)
This enables lossless roundtrip workflows:
import polars_bio as pb
# Read with full header preservation
df = pb.read_bam("input.bam")
# Filter records
filtered = df.filter(pl.col("mapping_quality") > 20)
# Write back - header metadata is preserved
pb.write_bam(filtered, "filtered.bam")
Polars Extension Methods
Write functions are also available as Polars namespace extensions:
import polars_bio as pb
# DataFrame extensions
df = pb.read_bam("input.bam")
df.pb.write_bam("output.bam", sort_on_write=True)
df.pb.write_sam("output.sam")
df.pb.write_cram("output.cram", reference_path="ref.fa")
df.pb.write_vcf("output.vcf.bgz")
df.pb.write_fastq("output.fastq.gz")
# LazyFrame extensions
lf = pb.scan_bam("input.bam")
lf.pb.sink_bam("output.bam", sort_on_write=True)
lf.pb.sink_sam("output.sam")
lf.pb.sink_cram("output.cram", reference_path="ref.fa")
lf.pb.sink_vcf("output.vcf.bgz")
lf.pb.sink_fastq("output.fastq.bgz")
SQL-powered data processing
polars-bio provides a SQL-like API for bioinformatic data querying or manipulation. Check SQL reference for more details.
import polars_bio as pb
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv", info_fields=["SVTYPE", "SVLEN"])
pb.sql("SELECT * FROM gnomad_sv WHERE SVTYPE = 'DEL' AND SVLEN > 1000").limit(3).collect()
Accessing registered tables
You can access registered tables programmatically using the ctx.table() method, which returns a DataFusion DataFrame:
import polars_bio as pb
from polars_bio.context import ctx
# Register a file as a table
pb.register_vcf("variants.vcf", name="my_variants")
# Get the table as a DataFusion DataFrame
df = ctx.table("my_variants")
# Access the Arrow schema (includes coordinate system metadata)
schema = df.schema()
print(schema.metadata) # {b'bio.coordinate_system_zero_based': b'false'}
# Execute queries on the DataFrame
result = df.filter(df["chrom"] == "chr1").collect()
Tip
The ctx.table() method is useful for:
- Accessing Arrow schema metadata (including coordinate system information)
- Using the DataFusion DataFrame API directly
- Integrating with other DataFusion-based tools
Schema Inspection
Quickly inspect BAM/CRAM file schemas without reading the entire file:
import polars_bio as pb
# Get schema information for BAM file
schema = pb.describe_bam("file.bam")
print(schema)
# shape: (11, 2)
# βββββββββββββββββββ¬βββββββββββ
# β column β datatype β
# β --- β --- β
# β str β str β
# βββββββββββββββββββͺβββββββββββ‘
# β name β String β
# β chrom β String β
# β start β UInt32 β
# ...
# Include tag columns in schema
schema = pb.describe_bam("file.bam", tag_fields=["NM", "AS", "MD"])
print(schema) # Shows 14 columns including tags
# CRAM schema
schema = pb.describe_cram("file.cram")
BAM Optional Tags
polars-bio supports reading BAM optional alignment tags as individual columns. Tags are only parsed when explicitly requested, ensuring zero overhead for standard reads.
Note: CRAM tag support is planned for a future release. The
tag_fieldsparameter is accepted for CRAM functions but currently ignored with a warning.
Usage
import polars_bio as pb
# Read BAM with specific tags
df = pb.read_bam(
"alignments.bam",
tag_fields=["NM", "AS", "MD"] # Edit distance, alignment score, mismatch string
)
# Tags appear as regular columns
print(df.select(["name", "chrom", "NM", "AS"]))
# Lazy scan with tag filtering
lf = pb.scan_bam("alignments.bam", tag_fields=["NM", "AS"])
high_quality = lf.filter((pl.col("NM") <= 2) & (pl.col("AS") >= 100)).collect()
# SQL queries (tags must be quoted)
pb.register_bam("alignments.bam", "reads", tag_fields=["NM", "RG"])
result = pb.sql('SELECT name, "NM" FROM reads WHERE "NM" <= 2').collect()
Common Tags
- NM (Int32): Edit distance to reference
- MD (Utf8): Mismatch positions string
- AS (Int32): Alignment score
- XS (Int32): Secondary alignment score
- RG (Utf8): Read group identifier
- CB (Utf8): Cell barcode (single-cell)
- UB (Utf8): UMI barcode (single-cell)
Full registry includes ~40 common SAM tags.
Performance
- Zero overhead when
tag_fields=None(default) - Projection pushdown: only selected tags are parsed
- Tags parsed once per batch, not per record
shape: (3, 10)
βββββββββ¬ββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββ¬ββββ¬ββββββββ¬βββββββββββββ¬βββββββββ¬ββββββββ
β chrom β start β end β id β β¦ β qual β filter β svtype β svlen β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β str β u32 β u32 β str β β f64 β str β str β i32 β
βββββββββͺββββββββͺββββββββͺβββββββββββββββββββββββββββββββββͺββββͺββββββββͺβββββββββββββͺβββββββββͺββββββββ‘
β chr1 β 22000 β 30000 β gnomAD-SV_v3_DEL_chr1_fa103016 β β¦ β 999.0 β HIGH_NCR β DEL β 8000 β
β chr1 β 40000 β 47000 β gnomAD-SV_v3_DEL_chr1_b26f63f7 β β¦ β 145.0 β PASS β DEL β 7000 β
β chr1 β 79086 β 88118 β gnomAD-SV_v3_DEL_chr1_733c4ef0 β β¦ β 344.0 β UNRESOLVED β DEL β 9032 β
βββββββββ΄ββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββ΄ββββ΄ββββββββ΄βββββββββββββ΄βββββββββ΄ββββββββ
You can use view mechanism to create a virtual table from a DataFrame that contain preprocessing steps and reuse it in multiple steps. To avoid materializing the intermediate results in memory, you can run your processing in streaming mode.
Parallel engine ποΈ
It is straightforward to parallelize operations in polars-bio. The library is built on top of Apache DataFusion you can set
the degree of parallelism using the datafusion.execution.target_partitions option, e.g.:
Tip
- The default value is 1 (parallel execution disabled).
- The
datafusion.execution.target_partitionsoption is a global setting and affects all operations in the current session. - Check available strategies for optimal performance.
- See the other configuration settings in the Apache DataFusion documentation.
Cloud storage βοΈ
polars-bio supports direct streamed reading from cloud storages (e.g. S3, GCS) enabling processing large-scale genomics data without materializing in memory. It is built upon the OpenDAL project, a unified data access layer for cloud storage, which allows to read bioinformatic file formats from various cloud storage providers. For Apache DataFusion native file formats, such as Parquet or CSV please refer to DataFusion user guide.
Example
import polars_bio as pb
## Register VCF files from Google Cloud Storage that will be streamed - no need to download them to the local disk, size ~0.8TB
pb.register_vcf("gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.vcf.bgz", "gnomad_big", allow_anonymous=True)
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv", allow_anonymous=True)
pb.overlap("gnomad_sv", "gnomad_big", streaming=True).sink_parquet("/tmp/overlap.parquet")
Tip
If you access cloud storage with authentication provided, please make sure the allow_anonymous parameter is set to False in the read/describe/register_table functions.
Supported features
| Feature | AWS S3 | Google Cloud Storage | Azure Blob Storage |
|---|---|---|---|
| Anonymous access | |||
| Authenticated access | |||
| Requester Pays | |||
| Concurrent requests1 | |||
| Streaming reads |
Note
1For more information on concurrent requests and block size tuning please refer to issue.
AWS S3 configuration
Supported environment variables:
| Variable | Description |
|---|---|
| AWS_ACCESS_KEY_ID | AWS access key ID for authenticated access to S3. |
| AWS_SECRET_ACCESS_KEY | AWS secret access key for authenticated access to S3. |
| AWS_ENDPOINT_URL | Custom S3 endpoint URL for accessing S3-compatible storage. |
| AWS_REGION or AWS_DEFAULT_REGION | AWS region for accessing S3. |
Google Cloud Storage configuration
Supported environment variables:
| Variable | Description |
|---|---|
| GOOGLE_APPLICATION_CREDENTIALS | Path to the Google Cloud service account key file for authenticated access to GCS. |
Azure Blob Storage configuration
Supported environment variables:
| Variable | Description |
|---|---|
| AZURE_STORAGE_ACCOUNT | Azure Storage account name for authenticated access to Azure Blob Storage. |
| AZURE_STORAGE_KEY | Azure Storage account key for authenticated access to Azure Blob Storage. |
| AZURE_ENDPOINT_URL | Azure Blob Storage endpoint URL for accessing Azure Blob Storage. |
Streaming π
polars-bio supports out-of-core processing with Apache DataFusion async streams and Polars LazyFrame streaming option. It can bring significant speedup as well reduction in memory usage allowing to process large datasets that do not fit in memory. See our benchmark results. There are 2 ways of using streaming mode:
-
By setting the
output_typetodatafusion.DataFrameand using the Python DataFrame API, including methods such as count, write_parquet or write_csv or write_json. In this option you completely bypass the polars streaming engine.python import polars_bio as pb import polars as pl pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").write_parquet("/tmp/overlap.parquet") pl.scan_parquet("/tmp/overlap.parquet").collect().count()shape: (1, 6) ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ β chrom_1 β start_1 β end_1 β chrom_2 β start_2 β end_2 β β --- β --- β --- β --- β --- β --- β β u32 β u32 β u32 β u32 β u32 β u32 β ββββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββ‘ β 2629727337 β 2629727337 β 2629727337 β 2629727337 β 2629727337 β 2629727337 β ββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββTip
If you only need to write the results as fast as possible into one of the above file formats or quickly get the row count, then it is in the most cases the best option.
-
Using polars new streaming engine:
import os import polars_bio as pb os.environ['BENCH_DATA_ROOT'] = "/Users/mwiewior/research/data/databio" os.environ['POLARS_VERBOSE'] = "1" cols=["contig", "pos_start", "pos_end"] BENCH_DATA_ROOT = os.getenv('BENCH_DATA_ROOT', '/data/bench_data/databio') df_1 = f"{BENCH_DATA_ROOT}/exons/*.parquet" df_2 = f"{BENCH_DATA_ROOT}/exons/*.parquet" pb.overlap(df_1, df_2, cols1=cols, cols2=cols).collect(engine="streaming").limit()1652814rows [00:00, 20208793.67rows/s] [MultiScanState]: Readers disconnected polars-stream: done running graph phase polars-stream: updating graph state shape: (5, 6) ββββββββββββ¬ββββββββββββββ¬ββββββββββββ¬βββββββββββ¬ββββββββββββββ¬ββββββββββββ β contig_1 β pos_start_1 β pos_end_1 β contig_2 β pos_start_2 β pos_end_2 β β --- β --- β --- β --- β --- β --- β β str β i32 β i32 β str β i32 β i32 β ββββββββββββͺββββββββββββββͺββββββββββββͺβββββββββββͺββββββββββββββͺββββββββββββ‘ β chr1 β 11873 β 12227 β chr1 β 11873 β 12227 β β chr1 β 12612 β 12721 β chr1 β 12612 β 12721 β β chr1 β 13220 β 14409 β chr1 β 13220 β 14409 β β chr1 β 13220 β 14409 β chr1 β 14361 β 14829 β β chr1 β 14361 β 14829 β chr1 β 13220 β 14409 β ββββββββββββ΄ββββββββββββββ΄ββββββββββββ΄βββββββββββ΄ββββββββββββββ΄ββββββββββββ
Parallellism can be controlled using the datafusion.execution.target_partitionsoption as described in the parallel engine section (compare the row/s metric in the following examples).
pb.set_option("datafusion.execution.target_partitions", "1")
pb.overlap(df_1, df_2, cols1=cols, cols2=cols).collect(engine="streaming").count()
1652814rows [00:00, 19664163.99rows/s]
shape: (1, 6)
ββββββββββββ¬ββββββββββββββ¬ββββββββββββ¬βββββββββββ¬ββββββββββββββ¬ββββββββββββ
β contig_1 β pos_start_1 β pos_end_1 β contig_2 β pos_start_2 β pos_end_2 β
β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β
ββββββββββββͺββββββββββββββͺββββββββββββͺβββββββββββͺββββββββββββββͺββββββββββββ‘
β 1652814 β 1652814 β 1652814 β 1652814 β 1652814 β 1652814 β
ββββββββββββ΄ββββββββββββββ΄ββββββββββββ΄βββββββββββ΄ββββββββββββββ΄ββββββββββββ
pb.set_option("datafusion.execution.target_partitions", "2")
pb.overlap(df_1, df_2, cols1=cols, cols2=cols).collect(engine="streaming").count()
1652814rows [00:00, 27841987.75rows/s]
shape: (1, 6)
ββββββββββββ¬ββββββββββββββ¬ββββββββββββ¬βββββββββββ¬ββββββββββββββ¬ββββββββββββ
β contig_1 β pos_start_1 β pos_end_1 β contig_2 β pos_start_2 β pos_end_2 β
β --- β --- β --- β --- β --- β --- β
β u32 β u32 β u32 β u32 β u32 β u32 β
ββββββββββββͺββββββββββββββͺββββββββββββͺβββββββββββͺββββββββββββββͺββββββββββββ‘
β 1652814 β 1652814 β 1652814 β 1652814 β 1652814 β 1652814 β
ββββββββββββ΄ββββββββββββββ΄ββββββββββββ΄βββββββββββ΄ββββββββββββββ΄ββββββββββββ
Compression
polars-bio supports GZIP (default file extension *.gz) and Block GZIP (BGZIP, default file extension *.bgz) when reading files from local and cloud storages.
For BGZIP-compressed FASTQ files, parallel decoding of compressed blocks is automatic β see Automatic parallel partitioning and Index file generation for details. Please take a look at the following GitHub discussion.
DataFrames support
| I/O | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
|---|---|---|---|---|---|---|
| Pandas DataFrame | ||||||
| Polars DataFrame | ||||||
| Polars LazyFrame | ||||||
| Native readers |
Polars Integration
polars-bio leverages deep integration with Polars through the Arrow C Data Interface, enabling high-performance zero-copy data exchange between Polars LazyFrames and the Rust-based genomic range operations engine.
Architecture
flowchart LR
subgraph Python["Python Layer"]
LF["Polars LazyFrame"]
DF["Polars DataFrame"]
end
subgraph FFI["Arrow C Data Interface"]
stream["Arrow C Stream<br/>(__arrow_c_stream__)"]
end
subgraph Rust["Rust Layer (polars-bio)"]
reader["ArrowArrayStreamReader"]
datafusion["DataFusion Engine"]
range_ops["Range Operations<br/>(overlap, nearest, etc.)"]
end
LF --> |"ArrowStreamExportable"| stream
DF --> |"to_arrow()"| stream
stream --> |"Zero-copy FFI"| reader
reader --> datafusion
datafusion --> range_ops
How It Works
When you pass a Polars LazyFrame to range operations like overlap() or nearest():
- Stream Export: The LazyFrame exports itself as an Arrow C Stream via
collect_batches(lazy=True)._inner.__arrow_c_stream__()(Polars >= 1.37.0) - Zero-Copy Transfer: The stream pointer is passed directly to Rust - no data copying or Python object conversion
- GIL-Free Execution: Once the stream is exported, all data processing happens in Rust without holding Python's GIL
- Streaming Execution: Data flows through DataFusion's streaming engine, processing batches on-demand
Performance Benefits
| Aspect | Previous Approach | Arrow C Stream |
|---|---|---|
| GIL acquisition | Per batch | Once at export |
| Data conversion | Polars β PyArrow β Arrow | Direct FFI |
| Memory overhead | Python iterator objects | None |
| Batch processing | Python __next__() calls |
Native Rust iteration |
Requirements
- Polars >= 1.37.0 (required for
ArrowStreamExportable)
Batch Size Configuration
polars-bio automatically synchronizes the batch size between Polars streaming and DataFusion execution. When you set datafusion.execution.batch_size, Polars' collect_batches() will use the same chunk size:
import polars_bio as pb
# Set batch size for both Polars and DataFusion
pb.set_option("datafusion.execution.batch_size", "8192")
# Now LazyFrame streaming uses 8192-row batches
# This ensures consistent memory usage and processing patterns
| Setting | Effect |
|---|---|
datafusion.execution.batch_size |
Controls batch size for both Polars streaming export and DataFusion processing |
| Default | 8192 rows (synchronized between Polars and DataFusion) |
"65536" |
Larger batches for high-throughput scenarios |
Tip
Matching batch sizes between Polars and DataFusion improves cache locality and reduces memory fragmentation when processing large datasets.
Example
import polars as pl
import polars_bio as pb
# Create a LazyFrame from a large file
lf1 = pl.scan_parquet("variants.parquet")
lf2 = pl.scan_parquet("regions.parquet")
# Set coordinate system metadata
lf1 = lf1.config_meta.set(coordinate_system_zero_based=True)
lf2 = lf2.config_meta.set(coordinate_system_zero_based=True)
# Range operation uses Arrow C Stream for efficient data transfer
result = pb.overlap(
lf1, lf2,
cols1=["chrom", "start", "end"],
cols2=["chrom", "start", "end"],
output_type="polars.LazyFrame"
)
# Execute with Polars streaming engine
result.collect(engine="streaming")