Quick start¶

Installation¶

From source (recommended during development)¶

vepyr requires a Rust toolchain and Python 3.10+.

Install uv and Rust:

curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone and build:

git clone git@github.com:biodatageeks/vepyr.git
cd vepyr
RUSTFLAGS="-C target-cpu=native" uv sync --reinstall-package vepyr

Verify:

uv run python -c "import vepyr; print(vepyr.__all__)"
# ['build_cache', 'annotate']

Building a cache¶

Before annotating variants you need to convert an Ensembl VEP offline cache to vepyr's optimized format.

Download and convert automatically¶

import vepyr

results = vepyr.build_cache(
    release=115,
    cache_dir="/data/vepyr_cache",
    cache_type="ensembl",
)
for path, rows in results:
    print(f"{path}: {rows:,} rows")

This downloads the Ensembl VEP 115 cache for homo_sapiens / GRCh38 and converts it to a partitioned Parquet cache.

Convert a local cache¶

If you already have the Ensembl VEP cache unpacked locally:

results = vepyr.build_cache(
    release=115,
    cache_dir="/data/vepyr_cache",
    cache_type="ensembl",
    local_cache="/data/ensembl_vep/homo_sapiens/115_GRCh38",
)

Options¶

Parameter	Default	Description
`partitions`	`8`	DataFusion partitions for parallel conversion
`species`	`homo_sapiens`	Species name
`assembly`	`GRCh38`	Genome assembly
`cache_type`	required	Ensembl VEP cache type: `ensembl`, `merged`, or `refseq`

Annotating variants¶

Basic annotation¶

import vepyr

lf = vepyr.annotate(
    vcf="input.vcf.gz",
    cache_dir="/data/vepyr_cache/parquet/115_GRCh38_ensembl",
    check_existing=True,
    af=True,
    max_af=True,
)

df = lf.collect()
print(df.select("chrom", "start", "ref", "alt", "most_severe_consequence").head())

Full `--everything` mode¶

Enable all annotation features (80-field CSQ). Requires a reference FASTA:

lf = vepyr.annotate(
    vcf="input.vcf.gz",
    cache_dir="/data/vepyr_cache/parquet/115_GRCh38_ensembl",
    everything=True,
    reference_fasta="GRCh38.fa",
)

df = lf.collect()
print(f"{df.height} variants x {df.width} columns")

workers controls how many within-contig annotation pipelines run concurrently. workers=1 is the serial path; workers > 1 requires a tabix-indexed (bgzip + .tbi) input VCF.

df = vepyr.annotate(
    "input.vcf.gz",
    "/data/vepyr_cache/parquet/115_GRCh38_ensembl",
    workers=4,
).collect()

Writing annotated VCF output¶

Write results directly to a VCF file instead of returning a LazyFrame:

out_path = vepyr.annotate(
    vcf="input.vcf.gz",
    cache_dir="/data/vepyr_cache/parquet/115_GRCh38_ensembl",
    everything=True,
    reference_fasta="GRCh38.fa",
    output_vcf="annotated.vcf.gz",  # .vcf.gz for bgzf, .vcf for plain
)
print(f"Wrote annotated VCF to {out_path}")