π¨Features
Features
Genomic ranges operations
Features | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
---|---|---|---|---|---|---|
overlap | ||||||
nearest | ||||||
count_overlaps | ||||||
cluster | ||||||
merge | ||||||
complement | ||||||
coverage | ||||||
expand | ||||||
sort | ||||||
read_table |
API comparison between libraries
There is no standard API for genomic ranges operations in Python. This table compares the API of the libraries. The table is not exhaustive and only shows the most common operations used in benchmarking.
operation | Bioframe | polars-bio | PyRanges0 | PyRanges1 | Pybedtools | GenomicRanges |
---|---|---|---|---|---|---|
overlap | overlap | overlap | join1 | join_ranges | intersect2 | find_overlaps3 |
nearest | closest | nearest | nearest | nearest | closest4 | nearest5 |
read_table | read_table | read_table | read_bed | read_bed | BedTool | read_bed |
Note
- There is an overlap method in PyRanges, but its output is only limited to indices of intervals from the other Dataframe that overlap. In Bioframe's benchmark also join method instead of overlap was used.
- wa and wb options used to obtain a comparable output.
- Output contains only a list with the same length as query, containing hits to overlapping indices. Data transformation is required to obtain the same output as in other libraries. Since the performance was far worse than in more efficient libraries anyway, additional data transformation was not included in the benchmark.
- s=first was used to obtain a comparable output.
- select="arbitrary" was used to obtain a comparable output.
File formats support
Format | Support level |
---|---|
BED | |
VCF | |
BAM | |
FASTQ | |
FASTA | |
GFF | |
GTF | |
Indexed VCF | |
Indexed BAM |
SQL-powered data processing
polars-bio provides a SQL-like API for bioinformatic data querying or manipulation. Check SQL reference for more details.
import polars_bio as pb
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_sv", thread_num=1, info_fields=["SVTYPE", "SVLEN"])
pb.sql("SELECT * FROM gnomad_sv WHERE SVTYPE = 'DEL' AND SVLEN > 1000").limit(3).collect()
shape: (3, 10)
βββββββββ¬ββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββ¬ββββ¬ββββββββ¬βββββββββββββ¬βββββββββ¬ββββββββ
β chrom β start β end β id β β¦ β qual β filter β svtype β svlen β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β str β u32 β u32 β str β β f64 β str β str β i32 β
βββββββββͺββββββββͺββββββββͺβββββββββββββββββββββββββββββββββͺββββͺββββββββͺβββββββββββββͺβββββββββͺββββββββ‘
β chr1 β 22000 β 30000 β gnomAD-SV_v3_DEL_chr1_fa103016 β β¦ β 999.0 β HIGH_NCR β DEL β 8000 β
β chr1 β 40000 β 47000 β gnomAD-SV_v3_DEL_chr1_b26f63f7 β β¦ β 145.0 β PASS β DEL β 7000 β
β chr1 β 79086 β 88118 β gnomAD-SV_v3_DEL_chr1_733c4ef0 β β¦ β 344.0 β UNRESOLVED β DEL β 9032 β
βββββββββ΄ββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββ΄ββββ΄ββββββββ΄βββββββββββββ΄βββββββββ΄ββββββββ
Parallel engine and streaming processing ποΈ
It is straightforward to parallelize operations in polars-bio. The library is built on top of Apache DataFusion you can set
the degree of parallelism using the datafusion.execution.target_partitions
option, e.g.:
Tip
- The default value is 1 (parallel execution disabled).
- The
datafusion.execution.target_partitions
option is a global setting and affects all operations in the current session. - Check available strategies for optimal performance.
- See the other configuration settings in the Apache DataFusion documentation.
Streaming (out-of-core processing) [Exeprimental]π§ͺ
polars-bio supports out-of-core processing with Polars LazyFrame streaming option. It can bring significant speedup as well reduction in memory usage allowing to process large datasets that do not fit in memory. See our benchmark results.
import os
import polars_bio as pb
os.environ['BENCH_DATA_ROOT'] = "/Users/mwiewior/research/data/databio"
os.environ['POLARS_MAX_THREADS'] = "1"
os.environ['POLARS_VERBOSE'] = "1"
cols=["contig", "pos_start", "pos_end"]
BENCH_DATA_ROOT = os.getenv('BENCH_DATA_ROOT', '/data/bench_data/databio')
df_1 = f"{BENCH_DATA_ROOT}/exons/*.parquet"
df_2 = f"{BENCH_DATA_ROOT}/exons/*.parquet"
pb.overlap(df_1, df_2, cols1=cols, cols2=cols, streaming=True).collect(streaming=True).limit()
INFO:polars_bio.operation:Running in streaming mode...
INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
'STREAMING:\n Anonymous SCAN []\n PROJECT */6 COLUMNS'
INFO:polars_bio.operation:Running in streaming mode...
INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
RUN STREAMING PIPELINE
[anonymous -> ordered_sink]
shape: (5, 6)
ββββββββββββ¬ββββββββββββββ¬ββββββββββββ¬βββββββββββ¬ββββββββββββββ¬ββββββββββββ
β contig_1 β pos_start_1 β pos_end_1 β contig_2 β pos_start_2 β pos_end_2 β
β --- β --- β --- β --- β --- β --- β
β str β i32 β i32 β str β i32 β i32 β
ββββββββββββͺββββββββββββββͺββββββββββββͺβββββββββββͺββββββββββββββͺββββββββββββ‘
β chr1 β 11873 β 12227 β chr1 β 11873 β 12227 β
β chr1 β 12612 β 12721 β chr1 β 12612 β 12721 β
β chr1 β 13220 β 14409 β chr1 β 13220 β 14409 β
β chr1 β 14361 β 14829 β chr1 β 13220 β 14409 β
β chr1 β 13220 β 14409 β chr1 β 14361 β 14829 β
ββββββββββββ΄ββββββββββββββ΄ββββββββββββ΄βββββββββββ΄ββββββββββββββ΄ββββββββββββ
Limitations
- Single threaded.
- Because of the bug only Polars sink operations, such as
collect
,sink_csv
orsink_parquet
are supported.
DataFrames support
I/O | Bioframe | polars-bio | PyRanges | Pybedtools | PyGenomics | GenomicRanges |
---|---|---|---|---|---|---|
Pandas DataFrame | ||||||
Polars DataFrame | ||||||
Polars LazyFrame | ||||||
Native readers |