🔨Features

Genomic ranges operations

Features	Bioframe	polars-bio	PyRanges	Pybedtools	PyGenomics	GenomicRanges
overlap
nearest
count_overlaps
cluster
merge
complement
coverage
expand
sort
read_table

Limitations

For now polars-bio uses int32 positions encoding for interval operations (issue) meaning that it does not support operation on chromosomes longer than 2Gb. int64 support is planned for future releases (issue).

Coordinate systems support

polars-bio supports both 0-based and 1-based coordinate systems for genomic ranges operations. By default, it uses 1-based coordinates system, for both reading bioinformatic input files (with methods read_* or register_* based on noodles that is 1-based, please refer to issue and issue and issue for more details) and all interval operations. If your data is in 0-based coordinates, you can set the use_zero_based parameter to True in the interval functions, e.g. overlap or nearest. This parameter can be especially useful when migrating your pipeline that used 0-based tools, such as for instance Bioframe. In such case, a warning message will be printed to the console, indicating that the coordinates are 0-based and end user is responsible for ensuring that the coordinates are 0-based.

API comparison between libraries

There is no standard API for genomic ranges operations in Python. This table compares the API of the libraries. The table is not exhaustive and only shows the most common operations used in benchmarking.

operation	Bioframe	polars-bio	PyRanges0	PyRanges1	Pybedtools	GenomicRanges
overlap	overlap	overlap	join¹	join_ranges	intersect²	find_overlaps³
nearest	closest	nearest	nearest	nearest	closest⁴	nearest⁵
read_table	read_table	read_table	read_bed	read_bed	BedTool	read_bed

Note

There is an overlap method in PyRanges, but its output is only limited to indices of intervals from the other Dataframe that overlap. In Bioframe's benchmark also join method instead of overlap was used.
wa and wb options used to obtain a comparable output.
Output contains only a list with the same length as query, containing hits to overlapping indices. Data transformation is required to obtain the same output as in other libraries. Since the performance was far worse than in more efficient libraries anyway, additional data transformation was not included in the benchmark.
s=first was used to obtain a comparable output.
select="arbitrary" was used to obtain a comparable output.

File formats support

For bioinfomatic format there are always two methods available: read_* and register_* that can be used to either read file into Polars/Pandas DataFrame or register it as a DataFusion table for further processing using SQL or builtin interval fuctions. In either case local and or cloud storage files can be used as an input. Please refer to cloud storage section for more details.

Format	Support level
BED
VCF
BAM
FASTQ
FASTA
GFF3
GTF
Indexed VCF
Indexed BAM

SQL-powered data processing

polars-bio provides a SQL-like API for bioinformatic data querying or manipulation. Check SQL reference for more details.

import polars_bio as pb
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv", thread_num=1, info_fields=["SVTYPE", "SVLEN"])
pb.sql("SELECT * FROM gnomad_sv WHERE SVTYPE = 'DEL' AND SVLEN > 1000").limit(3).collect()

shape: (3, 10)
┌───────┬───────┬───────┬────────────────────────────────┬───┬───────┬────────────┬────────┬───────┐
│ chrom ┆ start ┆ end   ┆ id                             ┆ … ┆ qual  ┆ filter     ┆ svtype ┆ svlen │
│ ---   ┆ ---   ┆ ---   ┆ ---                            ┆   ┆ ---   ┆ ---        ┆ ---    ┆ ---   │
│ str   ┆ u32   ┆ u32   ┆ str                            ┆   ┆ f64   ┆ str        ┆ str    ┆ i32   │
╞═══════╪═══════╪═══════╪════════════════════════════════╪═══╪═══════╪════════════╪════════╪═══════╡
│ chr1  ┆ 22000 ┆ 30000 ┆ gnomAD-SV_v3_DEL_chr1_fa103016 ┆ … ┆ 999.0 ┆ HIGH_NCR   ┆ DEL    ┆ 8000  │
│ chr1  ┆ 40000 ┆ 47000 ┆ gnomAD-SV_v3_DEL_chr1_b26f63f7 ┆ … ┆ 145.0 ┆ PASS       ┆ DEL    ┆ 7000  │
│ chr1  ┆ 79086 ┆ 88118 ┆ gnomAD-SV_v3_DEL_chr1_733c4ef0 ┆ … ┆ 344.0 ┆ UNRESOLVED ┆ DEL    ┆ 9032  │
└───────┴───────┴───────┴────────────────────────────────┴───┴───────┴────────────┴────────┴───────┘

You can use view mechanism to create a virtual table from a DataFrame that contain preprocessing steps and reuse it in multiple steps. To avoid materializing the intermediate results in memory, you can run your processing in streaming mode.

Parallel engine 🏎️

It is straightforward to parallelize operations in polars-bio. The library is built on top of Apache DataFusion you can set the degree of parallelism using the datafusion.execution.target_partitions option, e.g.:

import polars_bio as pb
pb.set_option("datafusion.execution.target_partitions", "8")

Tip

The default value is 1 (parallel execution disabled).
The datafusion.execution.target_partitions option is a global setting and affects all operations in the current session.
Check available strategies for optimal performance.
See the other configuration settings in the Apache DataFusion documentation.

Cloud storage ☁️

polars-bio supports direct streamed reading from cloud storages (e.g. S3, GCS) enabling processing large-scale genomics data without materializing in memory. It is built upon the OpenDAL project, a unified data access layer for cloud storage, which allows to read bioinformatic file formats from various cloud storage providers. For Apache DataFusion native file formats, such as Parquet or CSV please refer to DataFusion user guide.

Example

import polars_bio as pb
## Register VCF files from Google Cloud Storage that will be streamed - no need to download them to the local disk, size ~0.8TB
pb.register_vcf("gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.vcf.bgz", "gnomad_big", allow_anonymous=True)
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv", allow_anonymous=True)
pb.overlap("gnomad_sv", "gnomad_big", streaming=True).sink_parquet("/tmp/overlap.parquet")

It is especially useful when combined with SQL support for preprocessing and streaming processing capabilities.

Tip

If you access cloud storage with authentication provided, please make sure the allow_anonymous parameter is set to False in the read/describe/register_table functions.

Supported features

Feature	AWS S3	Google Cloud Storage	Azure Blob Storage
Anonymous access
Authenticated access
Requester Pays
Concurrent requests¹
Streaming reads

Note

¹For more information on concurrent requests and block size tuning please refer to issue.

AWS S3 configuration

Supported environment variables:

Variable	Description
AWS_ACCESS_KEY_ID	AWS access key ID for authenticated access to S3.
AWS_SECRET_ACCESS_KEY	AWS secret access key for authenticated access to S3.
AWS_ENDPOINT_URL	Custom S3 endpoint URL for accessing S3-compatible storage.
AWS_REGION or AWS_DEFAULT_REGION	AWS region for accessing S3.

Google Cloud Storage configuration

Supported environment variables:

Variable	Description
GOOGLE_APPLICATION_CREDENTIALS	Path to the Google Cloud service account key file for authenticated access to GCS.

Azure Blob Storage configuration

Supported environment variables:

Variable	Description
AZURE_STORAGE_ACCOUNT	Azure Storage account name for authenticated access to Azure Blob Storage.
AZURE_STORAGE_KEY	Azure Storage account key for authenticated access to Azure Blob Storage.
AZURE_ENDPOINT_URL	Azure Blob Storage endpoint URL for accessing Azure Blob Storage.

Streaming 🚂

polars-bio supports out-of-core processing with Apache DataFusion async streams and Polars LazyFrame streaming option. It can bring significant speedup as well reduction in memory usage allowing to process large datasets that do not fit in memory. See our benchmark results. There are 2 ways of using streaming mode:

By setting the output_type to datafusion.DataFrame and using the Python DataFrame API, including methods such as count, write_parquet or write_csv or write_json. In this option you completely bypass the polars streaming engine.

python import polars_bio as pb import polars as pl pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").write_parquet("/tmp/overlap.parquet") pl.scan_parquet("/tmp/overlap.parquet").collect().count()

 shape: (1, 6)
 ┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
 │ chrom_1    ┆ start_1    ┆ end_1      ┆ chrom_2    ┆ start_2    ┆ end_2      │
 │ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
 │ u32        ┆ u32        ┆ u32        ┆ u32        ┆ u32        ┆ u32        │
 ╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╡
 │ 2629727337 ┆ 2629727337 ┆ 2629727337 ┆ 2629727337 ┆ 2629727337 ┆ 2629727337 │
 └────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Tip

If you only need to write the results as fast as possible into one of the above file formats or quickly get the row count, then it is in the most cases the best option.

With the streaming option in the operation - using polars streaming (experimental and check limitations).:

import os
import polars_bio as pb
os.environ['BENCH_DATA_ROOT'] = "/Users/mwiewior/research/data/databio"
os.environ['POLARS_MAX_THREADS'] = "1"
os.environ['POLARS_VERBOSE'] = "1"

cols=["contig", "pos_start", "pos_end"]
BENCH_DATA_ROOT = os.getenv('BENCH_DATA_ROOT', '/data/bench_data/databio')
df_1 = f"{BENCH_DATA_ROOT}/exons/*.parquet"
df_2 =  f"{BENCH_DATA_ROOT}/exons/*.parquet"
pb.overlap(df_1, df_2, cols1=cols, cols2=cols, streaming=True).collect(streaming=True).limit()

INFO:polars_bio.operation:Running in streaming mode...
INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
'STREAMING:\n  Anonymous SCAN []\n  PROJECT */6 COLUMNS'

pb.overlap(df_1,df_2,cols1=cols,cols2=cols,streaming=True).collect(streaming=True).limit()

INFO:polars_bio.operation:Running in streaming mode...
INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
RUN STREAMING PIPELINE
[anonymous -> ordered_sink]
shape: (5, 6)
┌──────────┬─────────────┬───────────┬──────────┬─────────────┬───────────┐
│ contig_1 ┆ pos_start_1 ┆ pos_end_1 ┆ contig_2 ┆ pos_start_2 ┆ pos_end_2 │
│ ---      ┆ ---         ┆ ---       ┆ ---      ┆ ---         ┆ ---       │
│ str      ┆ i32         ┆ i32       ┆ str      ┆ i32         ┆ i32       │
╞══════════╪═════════════╪═══════════╪══════════╪═════════════╪═══════════╡
│ chr1     ┆ 11873       ┆ 12227     ┆ chr1     ┆ 11873       ┆ 12227     │
│ chr1     ┆ 12612       ┆ 12721     ┆ chr1     ┆ 12612       ┆ 12721     │
│ chr1     ┆ 13220       ┆ 14409     ┆ chr1     ┆ 13220       ┆ 14409     │
│ chr1     ┆ 14361       ┆ 14829     ┆ chr1     ┆ 13220       ┆ 14409     │
│ chr1     ┆ 13220       ┆ 14409     ┆ chr1     ┆ 14361       ┆ 14829     │
└──────────┴─────────────┴───────────┴──────────┴─────────────┴───────────┘

Limitations

Single threaded.
Because of the bug only Polars sink operations, such as collect, sink_csv or sink_parquet are supported.

Compression

polars-bio supports GZIP ( default file extension *.gz) and Block GZIP (BGZIP, default file extension *.bgz) when reading files from local and cloud storages. For BGZIP it is possible to parallelize decoding of compressed blocks to substantially speedup reading VCF, FASTQ or GFF files by increasing thread_num parameter. Please take a look at the following GitHub discussion.

DataFrames support

I/O	Bioframe	polars-bio	PyRanges	Pybedtools	PyGenomics	GenomicRanges
Pandas DataFrame
Polars DataFrame
Polars LazyFrame
Native readers