API reference¶

`vepyr.build_cache(release, cache_dir, *, cache_type, species='homo_sapiens', assembly='GRCh38', partitions=8, cache_format='parquet', local_cache=None, download_retries=10, show_progress=True, on_progress=None, overwrite=False)` ¶

Download an Ensembl VEP cache and convert it to an optimized cache.

Parameters:

Name	Type	Description	Default
`release`	`int`	Ensembl release number (e.g. 115).	required
`cache_dir`	`str`	Root directory for cache data and Parquet output.	required
`cache_type`	`str`	Required Ensembl VEP cache type: `"ensembl"`, `"merged"`, or `"refseq"`.	required
`species`	`str`	Species name (default: `"homo_sapiens"`).	`'homo_sapiens'`
`assembly`	`str`	Genome assembly (default: `"GRCh38"`).	`'GRCh38'`
`partitions`	`int`	Number of DataFusion partitions for parallelism (default: 8).	`8`
`cache_format`	`str`	Cache format to build. Only `"parquet"` is supported (default).	`'parquet'`
`local_cache`	`str or None`	Path to an already-unpacked Ensembl VEP cache directory (the one containing `info.txt`). When provided, downloading and extraction are skipped entirely.	`None`
`download_retries`	`int`	Maximum number of resume-retries for the cache download (default: 10). Each retry resumes from the last byte received.	`10`
`show_progress`	`bool`	Show tqdm progress bars during conversion (default: True). .. note:: The partitioned-Parquet build path does not currently emit per-batch progress events, so no bars appear during the cache build regardless of `show_progress` / `on_progress`.	`True`
`on_progress`	`callable or None`	Custom progress callback with signature `(entity, format, batch_rows, total_rows, total_expected)`. Overrides the default tqdm bars when provided. See the note on `show_progress` — the Parquet build path does not invoke it.	`None`
`overwrite`	`bool`	Rebuild existing cache outputs instead of skipping them.	`False`

Returns:

Type	Description
`list[tuple[str, int]]`	List of `(parquet_file_path, row_count)` for each written file.

`vepyr.build_plugin_cache(plugin, version, *, source_path, cache_dir, plugin_cache_root, chroms=None, plugins_repo=None, overwrite=False)` ¶

Build a per-chromosome plugin cache.

plugin/version select plugins/<plugin>/<plugin>.source.toml from the public vepyr-plugins repo at that git tag (or plugins_repo for offline). Tiering is inherited from the variation cache at cache_dir. Returns per-chrom (chrom, rows, warm, cold) tuples.

vepyr.annotate(vcf, cache_dir, *, everything=False, hgvs=False, hgvsc=False, hgvsp=False, shift_hgvs=None, no_escape=False, remove_hgvsp_version=False, hgvsp_use_prediction=False, reference_fasta=None, check_existing=False, af=False, af_1kg=False, af_gnomade=False, af_gnomadg=False, max_af=False, pubmed=False, cache_format='parquet', extended_probes=True, distance=None, gencode_basic=False, gencode_primary=False, all_refseq=False, exclude_predicted=False, pick=False, pick_allele=False, per_gene=False, pick_allele_gene=False, flag_pick=False, flag_pick_allele=False, flag_pick_allele_gene=False, pick_order=None, buffer_size=5000, failed=0, cache_size_mb=1024, workers=1, skip_csq=True, plugin_cache_root=None, output_vcf=None, show_progress=True, compression=None, on_batch_written=None) ¶

Annotate variants from a VCF file with VEP consequences.

Reads the VCF, runs annotate_vep() against the partitioned parquet cache produced by :func:build_cache, and returns a polars LazyFrame.

The engine auto-discovers context tables (transcript, exon, translation, regulatory, motif) from cache_dir subdirectories.

Parameters:

Name	Type	Description	Default
`vcf`	`str`	Path to the input VCF file.	required
`cache_dir`	`str`	Path to the parquet cache directory produced by :func:`build_cache`, e.g. `"/data/vep/wgs/parquet/115_GRCh38_ensembl"`.	required
`everything`	`bool`	Enable all annotation features (80-field CSQ). Implies `hgvs`, `af`, `check_existing`, `pubmed`, etc. Requires `reference_fasta`.	`False`
`hgvs`	`bool`	Add HGVS notation. Implies `hgvsc`, `hgvsp`, `shift_hgvs`. Requires `reference_fasta`.	`False`
`hgvsc`	`bool`	Enable HGVSc notation (implied by `hgvs`/`everything`).	`False`
`hgvsp`	`bool`	Enable HGVSp notation (implied by `hgvs`/`everything`).	`False`
`shift_hgvs`	`bool or None`	3' shift HGVS notation. `None` = auto (True when hgvs enabled).	`None`
`no_escape`	`bool`	Don't URI-escape HGVS strings.	`False`
`remove_hgvsp_version`	`bool`	Remove version from HGVSp transcript ID.	`False`
`hgvsp_use_prediction`	`bool`	Use predicted rather than observed protein sequence.	`False`
`reference_fasta`	`str or None`	Path to reference FASTA (required for HGVS/everything).	`None`
`check_existing`	`bool`	Check for co-located known variants (implied by AF flags).	`False`
`af`	`bool`	Include allele frequencies.	`False`
`af_1kg`	`bool`	Include 1000 Genomes allele frequencies.	`False`
`af_gnomade`	`bool`	Include gnomAD exome allele frequencies.	`False`
`af_gnomadg`	`bool`	Include gnomAD genome allele frequencies.	`False`
`max_af`	`bool`	Include maximum AF across populations.	`False`
`pubmed`	`bool`	Include PubMed IDs for co-located variants.	`False`
`extended_probes`	`bool`	Use interval-overlap fallback for shifted indels (default: True).	`True`
`distance`	`int or tuple[int, int] or None`	Upstream/downstream distance for transcript overlap. Single int = both directions; tuple = (upstream, downstream).	`None`
`gencode_basic`	`bool`	Restrict to transcripts in the GENCODE basic set. Mutually exclusive with `gencode_primary`.	`False`
`gencode_primary`	`bool`	Restrict to transcripts in the GENCODE primary set (GRCh38 only). Mutually exclusive with `gencode_basic`.	`False`
`all_refseq`	`bool`	Keep all RefSeq transcripts including CCDS/EST-style rows.	`False`
`exclude_predicted`	`bool`	Exclude predicted RefSeq transcripts (`XM_` / `XR_` prefixes).	`False`
`pick`	`bool`	Emit one selected consequence per variant, matching VEP `--pick`.	`False`
`pick_allele`	`bool`	Emit one selected consequence per allele, matching VEP `--pick_allele`.	`False`
`per_gene`	`bool`	Emit one selected consequence per gene while retaining non-transcript rows, matching VEP `--per_gene`.	`False`
`pick_allele_gene`	`bool`	Emit one selected consequence per allele and gene, matching VEP `--pick_allele_gene`.	`False`
`flag_pick`	`bool`	Retain all consequences and add `PICK=1` to one selected entry per variant, matching VEP `--flag_pick`.	`False`
`flag_pick_allele`	`bool`	Retain all consequences and add `PICK=1` to one selected entry per allele, matching VEP `--flag_pick_allele`.	`False`
`flag_pick_allele_gene`	`bool`	Add a standalone `PICK=1` CSQ field for the selected transcript per allele and gene, matching VEP `--flag_pick_allele_gene`.	`False`
`pick_order`	`str or None`	Comma-separated VEP pick ranking order, e.g. `"biotype,rank,mane_select,tsl,canonical,appris,ccds,length"`.	`None`
`buffer_size`	`int`	Number of input variants per VEP-style annotation buffer. Defaults to Ensembl VEP's `--buffer_size` default of `5000`.	`5000`
`failed`	`int`	Maximum allowed `failed` flag value from cache (default: 0).	`0`
`cache_format`	`str`	Cache format to use. Only `"parquet"` is supported (default).	`'parquet'`
`cache_size_mb`	`int`	Annotation cache size in MB (default: 1024).	`1024`
`workers`	`int`	Number of within-contig fused annotation pipelines (default: 1). The single annotation-concurrency knob. `1` is serial; values greater than 1 require a tabix-indexed (bgzip + `.tbi`) input VCF.	`1`
`skip_csq`	`bool`	Exclude the raw CSQ column from the output (default: True). When True, only the parsed annotation columns are returned.	`True`
`output_vcf`	`str or None`	Path to write annotated VCF output. When set, annotation results are written directly to a VCF file and the output path is returned. When `None` (default), returns a polars `LazyFrame`. Compression is auto-detected from the file extension: `.vcf` for plain text, `.vcf.gz` or `.vcf.bgz` for block-gzipped (bgzf). Override with the `compression` parameter.	`None`
`show_progress`	`bool`	Show a progress bar on stderr during VCF output (default: True). Only used when `output_vcf` is set.	`True`
`compression`	`str or None`	VCF output compression. `"bgzf"` (block-gzip, tabix-compatible), `"gzip"`, `"plain"`, or `None` (auto-detect from extension). Only used when `output_vcf` is set.	`None`
`on_batch_written`	`callable or None`	Callback invoked after each batch is written to VCF, with signature `(batch_rows: int, total_rows: int, total_input: int)`. `total_rows` is the cumulative number of VCF records written so far. `total_input` is the total number of input variants when known. Useful for driving tqdm progress bars in notebooks. Only used when `output_vcf` is set.	`None`

Returns:

Type	Description
`LazyFrame or str`	When `output_vcf` is `None`: annotated variants as a polars `LazyFrame` with typed annotation columns plus original VCF fields. When `output_vcf` is set: the output VCF file path.

Examples:

>>> import vepyr
>>> lf = vepyr.annotate("input.vcf", "/data/vep/parquet/115_GRCh38_ensembl")
>>> lf.collect()

>>> # Full annotation with all features
>>> lf = vepyr.annotate(
...     "input.vcf",
...     "/data/vep/parquet/115_GRCh38_ensembl",
...     everything=True,
...     reference_fasta="/ref/GRCh38.fa",
... )

>>> # Selective: HGVS + allele frequencies
>>> lf = vepyr.annotate(
...     "input.vcf",
...     "/data/vep/parquet/115_GRCh38_ensembl",
...     hgvs=True,
...     af=True,
...     af_gnomadg=True,
...     reference_fasta="/ref/GRCh38.fa",
... )

>>> # Write annotated VCF directly
>>> path = vepyr.annotate(
...     "input.vcf",
...     "/data/vep/parquet/115_GRCh38_ensembl",
...     everything=True,
...     reference_fasta="/ref/GRCh38.fa",
...     output_vcf="annotated.vcf",
... )

API reference¶

vepyr.build_cache(release, cache_dir, *, cache_type, species='homo_sapiens', assembly='GRCh38', partitions=8, cache_format='parquet', local_cache=None, download_retries=10, show_progress=True, on_progress=None, overwrite=False) ¶

vepyr.build_plugin_cache(plugin, version, *, source_path, cache_dir, plugin_cache_root, chroms=None, plugins_repo=None, overwrite=False) ¶

`vepyr.build_cache(release, cache_dir, *, cache_type, species='homo_sapiens', assembly='GRCh38', partitions=8, cache_format='parquet', local_cache=None, download_retries=10, show_progress=True, on_progress=None, overwrite=False)` ¶

`vepyr.build_plugin_cache(plugin, version, *, source_path, cache_dir, plugin_cache_root, chroms=None, plugins_repo=None, overwrite=False)` ¶