Skip to content

πŸ”¨Features

Genomic ranges operations

Features Bioframe polars-bio PyRanges Pybedtools PyGenomics GenomicRanges
overlap βœ… βœ… βœ… βœ… βœ… βœ…
nearest βœ… βœ… βœ… βœ… βœ…
count_overlaps βœ… βœ… βœ… βœ… βœ… βœ…
cluster βœ… βœ… βœ…
merge βœ… βœ… βœ… βœ… βœ…
complement βœ… 🚧 βœ… βœ…
coverage βœ… βœ… βœ… βœ… βœ…
expand βœ… βœ… βœ… βœ… βœ…
sort βœ… βœ… βœ… βœ… βœ…
read_table βœ… βœ… βœ… βœ… βœ…

Coordinate systems support

polars-bio supports both 0-based and 1-based coordinate systems. Please check overlap_filter parameter of a given operation to choose the appropriate coordinate system, e.g. overlap operation.

API comparison between libraries

There is no standard API for genomic ranges operations in Python. This table compares the API of the libraries. The table is not exhaustive and only shows the most common operations used in benchmarking.

operation Bioframe polars-bio PyRanges0 PyRanges1 Pybedtools GenomicRanges
overlap overlap overlap join1 join_ranges intersect2 find_overlaps3
nearest closest nearest nearest nearest closest4 nearest5
read_table read_table read_table read_bed read_bed BedTool read_bed

Note

  1. There is an overlap method in PyRanges, but its output is only limited to indices of intervals from the other Dataframe that overlap. In Bioframe's benchmark also join method instead of overlap was used.
  2. wa and wb options used to obtain a comparable output.
  3. Output contains only a list with the same length as query, containing hits to overlapping indices. Data transformation is required to obtain the same output as in other libraries. Since the performance was far worse than in more efficient libraries anyway, additional data transformation was not included in the benchmark.
  4. s=first was used to obtain a comparable output.
  5. select="arbitrary" was used to obtain a comparable output.

File formats support

Format Support level
BED βœ…
VCF βœ…
BAM βœ…
FASTQ βœ…
FASTA βœ…
GFF 🚧
GTF 🚧
Indexed VCF 🚧
Indexed BAM 🚧

SQL-powered data processing

polars-bio provides a SQL-like API for bioinformatic data querying or manipulation. Check SQL reference for more details.

import polars_bio as pb
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv", thread_num=1, info_fields=["SVTYPE", "SVLEN"])
pb.sql("SELECT * FROM gnomad_sv WHERE SVTYPE = 'DEL' AND SVLEN > 1000").limit(3).collect()

You can use view mechanism to create a virtual table from a DataFrame that contain preprocessing steps and reuse it in multiple steps. To avoid materializing the intermediate results in memory, you can run your processing in streaming mode.

```shell
shape: (3, 10)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ chrom ┆ start ┆ end   ┆ id                             ┆ … ┆ qual  ┆ filter     ┆ svtype ┆ svlen β”‚
β”‚ ---   ┆ ---   ┆ ---   ┆ ---                            ┆   ┆ ---   ┆ ---        ┆ ---    ┆ ---   β”‚
β”‚ str   ┆ u32   ┆ u32   ┆ str                            ┆   ┆ f64   ┆ str        ┆ str    ┆ i32   β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ════════════════════════════════β•ͺ═══β•ͺ═══════β•ͺ════════════β•ͺ════════β•ͺ═══════║
β”‚ chr1  ┆ 22000 ┆ 30000 ┆ gnomAD-SV_v3_DEL_chr1_fa103016 ┆ … ┆ 999.0 ┆ HIGH_NCR   ┆ DEL    ┆ 8000  β”‚
β”‚ chr1  ┆ 40000 ┆ 47000 ┆ gnomAD-SV_v3_DEL_chr1_b26f63f7 ┆ … ┆ 145.0 ┆ PASS       ┆ DEL    ┆ 7000  β”‚
β”‚ chr1  ┆ 79086 ┆ 88118 ┆ gnomAD-SV_v3_DEL_chr1_733c4ef0 ┆ … ┆ 344.0 ┆ UNRESOLVED ┆ DEL    ┆ 9032  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Parallel engine 🏎️

It is straightforward to parallelize operations in polars-bio. The library is built on top of Apache DataFusion you can set the degree of parallelism using the datafusion.execution.target_partitions option, e.g.:

import polars_bio as pb
pb.set_option("datafusion.execution.target_partitions", "8")

Tip

  1. The default value is 1 (parallel execution disabled).
  2. The datafusion.execution.target_partitions option is a global setting and affects all operations in the current session.
  3. Check available strategies for optimal performance.
  4. See the other configuration settings in the Apache DataFusion documentation.

Cloud storage ☁️

polars-bio supports direct streamed reading from cloud storages (e.g. S3, GCS) enabling processing large-scale genomics data without materializing in memory.

import polars_bio as pb
## Register VCF files from Google Cloud Storage that will be streamed - no need to download them to the local disk, size ~0.8TB
pb.register_vcf("gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.vcf.bgz", "gnomad_big")
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv")
pb.overlap("gnomad_sv", "gnomad_big", streaming=True).sink_parquet("/tmp/overlap.parquet")
It is especially useful when combined with SQL support for preprocessing and streaming processing capabilities.

Streaming πŸš‚

polars-bio supports out-of-core processing with Apache DataFusion async streams and Polars LazyFrame streaming option. It can bring significant speedup as well reduction in memory usage allowing to process large datasets that do not fit in memory. See our benchmark results. There are 2 ways of using streaming mode:

  1. By setting the output_type to datafusion.DataFrame and using the Python DataFrame API, including methods such as count, write_parquet or write_csv or write_json. In this option you completely bypass the polars streaming engine.

    import polars_bio as pb
    import polars as pl
    pb.overlap("/tmp/gnomad.v4.1.sv.sites.parquet", "/tmp/gnomad.exomes.v4.1.sites.chr1.parquet", output_type="datafusion.DataFrame").write_parquet("/tmp/overlap.parquet")
    >>> pl.scan_parquet("/tmp/overlap.parquet").collect().count()
    shape: (1, 6)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ chrom_1    ┆ start_1    ┆ end_1      ┆ chrom_2    ┆ start_2    ┆ end_2      β”‚
    β”‚ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        β”‚
    β”‚ u32        ┆ u32        ┆ u32        ┆ u32        ┆ u32        ┆ u32        β”‚
    β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════β•ͺ════════════β•ͺ════════════β•ͺ════════════β•ͺ════════════║
    β”‚ 2629727337 ┆ 2629727337 ┆ 2629727337 ┆ 2629727337 ┆ 2629727337 ┆ 2629727337 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    

    Tip

    If you only need to write the results as fast as possible into one of the above file formats or quickly get the row count, then it is in the most cases the best option.

  2. With the streaming option in the operation - using polars streaming (experimental and check limitations).:

    import os
    import polars_bio as pb
    os.environ['BENCH_DATA_ROOT'] = "/Users/mwiewior/research/data/databio"
    os.environ['POLARS_MAX_THREADS'] = "1"
    os.environ['POLARS_VERBOSE'] = "1"
    
    cols=["contig", "pos_start", "pos_end"]
    BENCH_DATA_ROOT = os.getenv('BENCH_DATA_ROOT', '/data/bench_data/databio')
    df_1 = f"{BENCH_DATA_ROOT}/exons/*.parquet"
    df_2 =  f"{BENCH_DATA_ROOT}/exons/*.parquet"
    pb.overlap(df_1, df_2, cols1=cols, cols2=cols, streaming=True).collect(streaming=True).limit()
    
    INFO:polars_bio.operation:Running in streaming mode...
    INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
    'STREAMING:\n  Anonymous SCAN []\n  PROJECT */6 COLUMNS'
    

    pb.overlap(df_1, df_2, cols1=cols, cols2=cols, streaming=True).collect(streaming=True).limit()
    
    INFO:polars_bio.operation:Running in streaming mode...
    INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
    RUN STREAMING PIPELINE
    [anonymous -> ordered_sink]
    shape: (5, 6)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ contig_1 ┆ pos_start_1 ┆ pos_end_1 ┆ contig_2 ┆ pos_start_2 ┆ pos_end_2 β”‚
    β”‚ ---      ┆ ---         ┆ ---       ┆ ---      ┆ ---         ┆ ---       β”‚
    β”‚ str      ┆ i32         ┆ i32       ┆ str      ┆ i32         ┆ i32       β”‚
    β•žβ•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ═══════════β•ͺ══════════β•ͺ═════════════β•ͺ═══════════║
    β”‚ chr1     ┆ 11873       ┆ 12227     ┆ chr1     ┆ 11873       ┆ 12227     β”‚
    β”‚ chr1     ┆ 12612       ┆ 12721     ┆ chr1     ┆ 12612       ┆ 12721     β”‚
    β”‚ chr1     ┆ 13220       ┆ 14409     ┆ chr1     ┆ 13220       ┆ 14409     β”‚
    β”‚ chr1     ┆ 14361       ┆ 14829     ┆ chr1     ┆ 13220       ┆ 14409     β”‚
    β”‚ chr1     ┆ 13220       ┆ 14409     ┆ chr1     ┆ 14361       ┆ 14829     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    

Limitations

  1. Single threaded.
  2. Because of the bug only Polars sink operations, such as collect, sink_csv or sink_parquet are supported.

DataFrames support

I/O Bioframe polars-bio PyRanges Pybedtools PyGenomics GenomicRanges
Pandas DataFrame βœ… βœ… βœ… βœ…
Polars DataFrame βœ… βœ…
Polars LazyFrame βœ…
Native readers βœ