Skip to content

API reference

vepyr.build_cache(release, cache_dir, *, species='homo_sapiens', assembly='GRCh38', cache_type='vep', partitions=1, build_fjall=True, fjall_zstd_level=3, fjall_dict_size_kb=112, local_cache=None, download_retries=10, show_progress=True, on_progress=None)

Download an Ensembl VEP cache and convert it to optimized Parquet + fjall.

Parameters:

Name Type Description Default
release int

Ensembl release number (e.g. 115).

required
cache_dir str

Root directory for cache data and Parquet output.

required
species str

Species name (default: "homo_sapiens").

'homo_sapiens'
assembly str

Genome assembly (default: "GRCh38").

'GRCh38'
cache_type str

Cache type: "vep" (default), "merged", or "refseq".

'vep'
partitions int

Number of DataFusion partitions for parallelism (default: 1).

1
build_fjall bool

Build fjall KV stores for variation and sift lookups (default: True).

True
fjall_zstd_level int

Zstd compression level for fjall stores (default: 3).

3
fjall_dict_size_kb int

Zstd dictionary size in KB for fjall stores (default: 112).

112
local_cache str or None

Path to an already-unpacked Ensembl VEP cache directory (the one containing info.txt). When provided, downloading and extraction are skipped entirely.

None
download_retries int

Maximum number of resume-retries for the cache download (default: 10). Each retry resumes from the last byte received.

10
show_progress bool

Show tqdm progress bars during conversion (default: True).

True
on_progress callable or None

Custom progress callback with signature (entity, format, batch_rows, total_rows, total_expected). Overrides the default tqdm bars when provided.

None

Returns:

Type Description
list[tuple[str, int]]

List of (parquet_file_path, row_count) for each written file.

vepyr.annotate(vcf, cache_dir, *, everything=False, hgvs=False, hgvsc=False, hgvsp=False, shift_hgvs=None, no_escape=False, remove_hgvsp_version=False, hgvsp_use_prediction=False, reference_fasta=None, check_existing=False, af=False, af_1kg=False, af_gnomade=False, af_gnomadg=False, max_af=False, pubmed=False, use_fjall=False, extended_probes=True, distance=None, merged=False, failed=0, cache_size_mb=1024, skip_csq=True, output_vcf=None, show_progress=True, compression=None, on_batch_written=None)

Annotate variants from a VCF file with VEP consequences.

Reads the VCF, runs annotate_vep() against the partitioned parquet cache produced by :func:build_cache, and returns a polars LazyFrame.

The engine auto-discovers context tables (transcript, exon, translation, regulatory, motif) from cache_dir subdirectories.

Parameters:

Name Type Description Default
vcf str

Path to the input VCF file.

required
cache_dir str

Path to the parquet cache directory produced by :func:build_cache, e.g. "/data/vep/wgs/parquet/115_GRCh38_vep".

required
everything bool

Enable all annotation features (80-field CSQ). Implies hgvs, af, check_existing, pubmed, etc. Requires reference_fasta.

False
hgvs bool

Add HGVS notation. Implies hgvsc, hgvsp, shift_hgvs. Requires reference_fasta.

False
hgvsc bool

Enable HGVSc notation (implied by hgvs/everything).

False
hgvsp bool

Enable HGVSp notation (implied by hgvs/everything).

False
shift_hgvs bool or None

3' shift HGVS notation. None = auto (True when hgvs enabled).

None
no_escape bool

Don't URI-escape HGVS strings.

False
remove_hgvsp_version bool

Remove version from HGVSp transcript ID.

False
hgvsp_use_prediction bool

Use predicted rather than observed protein sequence.

False
reference_fasta str or None

Path to reference FASTA (required for HGVS/everything).

None
check_existing bool

Check for co-located known variants (implied by AF flags).

False
af bool

Include allele frequencies.

False
af_1kg bool

Include 1000 Genomes allele frequencies.

False
af_gnomade bool

Include gnomAD exome allele frequencies.

False
af_gnomadg bool

Include gnomAD genome allele frequencies.

False
max_af bool

Include maximum AF across populations.

False
pubmed bool

Include PubMed IDs for co-located variants.

False
extended_probes bool

Use interval-overlap fallback for shifted indels (default: True).

True
distance int or tuple[int, int] or None

Upstream/downstream distance for transcript overlap. Single int = both directions; tuple = (upstream, downstream).

None
merged bool

Use merged Ensembl+RefSeq cache.

False
failed int

Maximum allowed failed flag value from cache (default: 0).

0
use_fjall bool

Use fjall (embedded KV store) backend instead of parquet (default: False).

False
cache_size_mb int

Annotation cache size in MB (default: 1024).

1024
skip_csq bool

Exclude the raw CSQ column from the output (default: True). When True, only the parsed annotation columns are returned.

True
output_vcf str or None

Path to write annotated VCF output. When set, annotation results are written directly to a VCF file and the output path is returned. When None (default), returns a polars LazyFrame. Compression is auto-detected from the file extension: .vcf for plain text, .vcf.gz or .vcf.bgz for block-gzipped (bgzf). Override with the compression parameter.

None
show_progress bool

Show a progress bar on stderr during VCF output (default: True). Only used when output_vcf is set.

True
compression str or None

VCF output compression. "bgzf" (block-gzip, tabix-compatible), "gzip", "plain", or None (auto-detect from extension). Only used when output_vcf is set.

None
on_batch_written callable or None

Callback invoked after each batch is written to VCF, with signature (batch_rows: int, total_rows: int, total_input: int). total_rows is the cumulative number of VCF records written so far. total_input is the total number of input variants when known. Useful for driving tqdm progress bars in notebooks. Only used when output_vcf is set.

None

Returns:

Type Description
LazyFrame or str

When output_vcf is None: annotated variants as a polars LazyFrame with typed annotation columns plus original VCF fields. When output_vcf is set: the output VCF file path.

Examples:

>>> import vepyr
>>> lf = vepyr.annotate("input.vcf", "/data/vep/parquet/115_GRCh38_vep")
>>> lf.collect()
>>> # Full annotation with all features
>>> lf = vepyr.annotate(
...     "input.vcf",
...     "/data/vep/parquet/115_GRCh38_vep",
...     everything=True,
...     reference_fasta="/ref/GRCh38.fa",
... )
>>> # Selective: HGVS + allele frequencies
>>> lf = vepyr.annotate(
...     "input.vcf",
...     "/data/vep/parquet/115_GRCh38_vep",
...     hgvs=True,
...     af=True,
...     af_gnomadg=True,
...     reference_fasta="/ref/GRCh38.fa",
... )
>>> # Write annotated VCF directly
>>> path = vepyr.annotate(
...     "input.vcf",
...     "/data/vep/parquet/115_GRCh38_vep",
...     everything=True,
...     reference_fasta="/ref/GRCh38.fa",
...     output_vcf="annotated.vcf",
... )