Quick start¶
Installation¶
From source (recommended during development)¶
vepyr requires a Rust toolchain and Python 3.10+.
curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Clone and build:
git clone git@github.com:biodatageeks/vepyr.git
cd vepyr
RUSTFLAGS="-C target-cpu=native" uv sync --reinstall-package vepyr
- Verify:
Building a cache¶
Before annotating variants you need to convert an Ensembl VEP offline cache to vepyr's optimized format.
Download and convert automatically¶
import vepyr
results = vepyr.build_cache(
release=115,
cache_dir="/data/vepyr_cache",
)
for path, rows in results:
print(f"{path}: {rows:,} rows")
This downloads the Ensembl VEP 115 cache for homo_sapiens / GRCh38, converts it to Parquet, and builds fjall KV stores.
Convert a local cache¶
If you already have the Ensembl VEP cache unpacked locally:
results = vepyr.build_cache(
release=115,
cache_dir="/data/vepyr_cache",
local_cache="/data/ensembl_vep/homo_sapiens/115_GRCh38",
)
Options¶
| Parameter | Default | Description |
|---|---|---|
partitions |
1 |
DataFusion partitions for parallel conversion |
build_fjall |
True |
Build fjall KV stores alongside Parquet |
fjall_zstd_level |
3 |
Zstd compression level (1-22) |
species |
homo_sapiens |
Species name |
assembly |
GRCh38 |
Genome assembly |
cache_type |
vep |
Cache type: vep, merged, or refseq |
Annotating variants¶
Basic annotation¶
import vepyr
lf = vepyr.annotate(
vcf="input.vcf.gz",
cache_dir="/data/vepyr_cache/parquet/115_GRCh38_vep",
check_existing=True,
af=True,
max_af=True,
)
df = lf.collect()
print(df.select("chrom", "start", "ref", "alt", "most_severe_consequence").head())
Full --everything mode¶
Enable all annotation features (80-field CSQ). Requires a reference FASTA:
lf = vepyr.annotate(
vcf="input.vcf.gz",
cache_dir="/data/vepyr_cache/parquet/115_GRCh38_vep",
everything=True,
reference_fasta="GRCh38.fa",
)
df = lf.collect()
print(f"{df.height} variants x {df.width} columns")
Using the fjall backend¶
Pass use_fjall=True for faster co-located variant lookups on large caches:
lf = vepyr.annotate(
vcf="input.vcf.gz",
cache_dir="/data/vepyr_cache/parquet/115_GRCh38_vep",
check_existing=True,
af=True,
max_af=True,
use_fjall=True,
)
Writing annotated VCF output¶
Write results directly to a VCF file instead of returning a LazyFrame: