API reference¶
vepyr.build_cache(release, cache_dir, *, species='homo_sapiens', assembly='GRCh38', cache_type='vep', partitions=1, build_fjall=True, fjall_zstd_level=3, fjall_dict_size_kb=112, local_cache=None, download_retries=10, show_progress=True, on_progress=None)
¶
Download an Ensembl VEP cache and convert it to optimized Parquet + fjall.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
release
|
int
|
Ensembl release number (e.g. 115). |
required |
cache_dir
|
str
|
Root directory for cache data and Parquet output. |
required |
species
|
str
|
Species name (default: |
'homo_sapiens'
|
assembly
|
str
|
Genome assembly (default: |
'GRCh38'
|
cache_type
|
str
|
Cache type: |
'vep'
|
partitions
|
int
|
Number of DataFusion partitions for parallelism (default: 1). |
1
|
build_fjall
|
bool
|
Build fjall KV stores for variation and sift lookups (default: True). |
True
|
fjall_zstd_level
|
int
|
Zstd compression level for fjall stores (default: 3). |
3
|
fjall_dict_size_kb
|
int
|
Zstd dictionary size in KB for fjall stores (default: 112). |
112
|
local_cache
|
str or None
|
Path to an already-unpacked Ensembl VEP cache directory (the one
containing |
None
|
download_retries
|
int
|
Maximum number of resume-retries for the cache download (default: 10). Each retry resumes from the last byte received. |
10
|
show_progress
|
bool
|
Show tqdm progress bars during conversion (default: True). |
True
|
on_progress
|
callable or None
|
Custom progress callback with signature
|
None
|
Returns:
| Type | Description |
|---|---|
list[tuple[str, int]]
|
List of |
vepyr.annotate(vcf, cache_dir, *, everything=False, hgvs=False, hgvsc=False, hgvsp=False, shift_hgvs=None, no_escape=False, remove_hgvsp_version=False, hgvsp_use_prediction=False, reference_fasta=None, check_existing=False, af=False, af_1kg=False, af_gnomade=False, af_gnomadg=False, max_af=False, pubmed=False, use_fjall=False, extended_probes=True, distance=None, merged=False, failed=0, cache_size_mb=1024, skip_csq=True, output_vcf=None, show_progress=True, compression=None, on_batch_written=None)
¶
Annotate variants from a VCF file with VEP consequences.
Reads the VCF, runs annotate_vep() against the partitioned parquet
cache produced by :func:build_cache, and returns a polars LazyFrame.
The engine auto-discovers context tables (transcript, exon, translation,
regulatory, motif) from cache_dir subdirectories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vcf
|
str
|
Path to the input VCF file. |
required |
cache_dir
|
str
|
Path to the parquet cache directory produced by :func: |
required |
everything
|
bool
|
Enable all annotation features (80-field CSQ). Implies |
False
|
hgvs
|
bool
|
Add HGVS notation. Implies |
False
|
hgvsc
|
bool
|
Enable HGVSc notation (implied by |
False
|
hgvsp
|
bool
|
Enable HGVSp notation (implied by |
False
|
shift_hgvs
|
bool or None
|
3' shift HGVS notation. |
None
|
no_escape
|
bool
|
Don't URI-escape HGVS strings. |
False
|
remove_hgvsp_version
|
bool
|
Remove version from HGVSp transcript ID. |
False
|
hgvsp_use_prediction
|
bool
|
Use predicted rather than observed protein sequence. |
False
|
reference_fasta
|
str or None
|
Path to reference FASTA (required for HGVS/everything). |
None
|
check_existing
|
bool
|
Check for co-located known variants (implied by AF flags). |
False
|
af
|
bool
|
Include allele frequencies. |
False
|
af_1kg
|
bool
|
Include 1000 Genomes allele frequencies. |
False
|
af_gnomade
|
bool
|
Include gnomAD exome allele frequencies. |
False
|
af_gnomadg
|
bool
|
Include gnomAD genome allele frequencies. |
False
|
max_af
|
bool
|
Include maximum AF across populations. |
False
|
pubmed
|
bool
|
Include PubMed IDs for co-located variants. |
False
|
extended_probes
|
bool
|
Use interval-overlap fallback for shifted indels (default: True). |
True
|
distance
|
int or tuple[int, int] or None
|
Upstream/downstream distance for transcript overlap. Single int = both directions; tuple = (upstream, downstream). |
None
|
merged
|
bool
|
Use merged Ensembl+RefSeq cache. |
False
|
failed
|
int
|
Maximum allowed |
0
|
use_fjall
|
bool
|
Use fjall (embedded KV store) backend instead of parquet (default: False). |
False
|
cache_size_mb
|
int
|
Annotation cache size in MB (default: 1024). |
1024
|
skip_csq
|
bool
|
Exclude the raw CSQ column from the output (default: True). When True, only the parsed annotation columns are returned. |
True
|
output_vcf
|
str or None
|
Path to write annotated VCF output. When set, annotation results are
written directly to a VCF file and the output path is returned.
When |
None
|
show_progress
|
bool
|
Show a progress bar on stderr during VCF output (default: True).
Only used when |
True
|
compression
|
str or None
|
VCF output compression. |
None
|
on_batch_written
|
callable or None
|
Callback invoked after each batch is written to VCF, with signature
|
None
|
Returns:
| Type | Description |
|---|---|
LazyFrame or str
|
When |
Examples:
>>> import vepyr
>>> lf = vepyr.annotate("input.vcf", "/data/vep/parquet/115_GRCh38_vep")
>>> lf.collect()
>>> # Full annotation with all features
>>> lf = vepyr.annotate(
... "input.vcf",
... "/data/vep/parquet/115_GRCh38_vep",
... everything=True,
... reference_fasta="/ref/GRCh38.fa",
... )