👩🍳 Cookbook
Basic gnomAD VCF file reading¶
In [1]:
Copied!
import polars_bio as pb
import polars as pl
import polars_bio as pb
import polars as pl
INFO:polars_bio:Creating BioSessionContext
In [2]:
Copied!
gcs_vcf_path = (
"gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz"
)
gcs_vcf_path = (
"gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz"
)
In [3]:
Copied!
pb.read_vcf(gcs_vcf_path).limit(3).collect()
pb.read_vcf(gcs_vcf_path).limit(3).collect()
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz
Out[3]:
shape: (3, 8)
chrom | start | end | id | ref | alt | qual | filter |
---|---|---|---|---|---|---|---|
str | u32 | u32 | str | str | str | f64 | str |
"chr1" | 10000 | 295666 | "gnomAD-SV_v3_DUP_chr1_01c2781c" | "N" | "<DUP>" | 134.0 | "HIGH_NCR" |
"chr1" | 10434 | 10434 | "gnomAD-SV_v3_BND_chr1_1a45f73a" | "N" | "<BND>" | 260.0 | "HIGH_NCR;UNRESOLVED" |
"chr1" | 10440 | 10440 | "gnomAD-SV_v3_BND_chr1_3fa36917" | "N" | "<BND>" | 198.0 | "HIGH_NCR;UNRESOLVED" |
AWS S3¶
In [4]:
Copied!
aws_s3_vcf_path = "s3://gnomad-public-us-east-1/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz"
aws_s3_vcf_path = "s3://gnomad-public-us-east-1/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz"
In [5]:
Copied!
pb.read_vcf(aws_s3_vcf_path).limit(3).collect()
pb.read_vcf(aws_s3_vcf_path).limit(3).collect()
INFO:polars_bio:Table: gnomad_exomes_v4_1_sites_chr21_bgz registered for path: s3://gnomad-public-us-east-1/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz
Out[5]:
shape: (3, 8)
chrom | start | end | id | ref | alt | qual | filter |
---|---|---|---|---|---|---|---|
str | u32 | u32 | str | str | str | f64 | str |
"chr21" | 5031905 | 5031905 | "" | "C" | "A" | 0.0 | "AC0;AS_VQSR" |
"chr21" | 5031905 | 5031905 | "" | "C" | "T" | 0.0 | "AC0;AS_VQSR" |
"chr21" | 5031909 | 5031909 | "" | "T" | "C" | 0.0 | "AC0;AS_VQSR" |
2. How to specify additional VCF INFO fields to be parsed¶
In [6]:
Copied!
vcf_info_fields = ["SVTYPE", "SVLEN"]
pb.read_vcf(gcs_vcf_path, info_fields=vcf_info_fields).limit(3).collect()
vcf_info_fields = ["SVTYPE", "SVLEN"]
pb.read_vcf(gcs_vcf_path, info_fields=vcf_info_fields).limit(3).collect()
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz
Out[6]:
shape: (3, 10)
chrom | start | end | id | ref | alt | qual | filter | svtype | svlen |
---|---|---|---|---|---|---|---|---|---|
str | u32 | u32 | str | str | str | f64 | str | str | i32 |
"chr1" | 10000 | 295666 | "gnomAD-SV_v3_DUP_chr1_01c2781c" | "N" | "<DUP>" | 134.0 | "HIGH_NCR" | "DUP" | 285666 |
"chr1" | 10434 | 10434 | "gnomAD-SV_v3_BND_chr1_1a45f73a" | "N" | "<BND>" | 260.0 | "HIGH_NCR;UNRESOLVED" | "BND" | -1 |
"chr1" | 10440 | 10440 | "gnomAD-SV_v3_BND_chr1_3fa36917" | "N" | "<BND>" | 198.0 | "HIGH_NCR;UNRESOLVED" | "BND" | -1 |
3. How to spead up reading local VCF files with multiple threads¶
In [ ]:
Copied!
! gsutil -m cp $gcs_vcf_path /tmp/gnomad.v4.1.sv.sites.vcf.gz &> /dev/null
! gsutil -m cp $gcs_vcf_path /tmp/gnomad.v4.1.sv.sites.vcf.gz &> /dev/null
In [7]:
Copied!
%%time
pb.read_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", thread_num=1).count().collect()
%%time
pb.read_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", thread_num=1).count().collect()
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz
0rows [00:00, ?rows/s]
CPU times: user 12.9 s, sys: 2.08 s, total: 15 s Wall time: 11.2 s
Out[7]:
shape: (1, 8)
chrom | start | end | id | ref | alt | qual | filter |
---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
2154486 | 2154486 | 2154486 | 2154486 | 2154486 | 2154486 | 2154486 | 2154486 |
In [8]:
Copied!
%%time
pb.read_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", thread_num=4).count().collect()
%%time
pb.read_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", thread_num=4).count().collect()
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz
0rows [00:00, ?rows/s]
CPU times: user 13 s, sys: 1.78 s, total: 14.8 s Wall time: 3.45 s
Out[8]:
shape: (1, 8)
chrom | start | end | id | ref | alt | qual | filter |
---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
2154486 | 2154486 | 2154486 | 2154486 | 2154486 | 2154486 | 2154486 | 2154486 |
4. How to perform an overlap operation on two remote VCF files in streaming mode¶
In [2]:
Copied!
vcf_1 = "gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz"
vcf_2 = "gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz"
vcf_1 = "gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz"
vcf_2 = "gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz"
In [3]:
Copied!
vcf_read_options_1 = pb.VcfReadOptions(info_fields=["SVTYPE", "SVLEN"], thread_num=1)
read_options_1 = pb.ReadOptions(vcf_read_options=vcf_read_options_1)
vcf_read_options_1 = pb.VcfReadOptions(info_fields=["SVTYPE", "SVLEN"], thread_num=1)
read_options_1 = pb.ReadOptions(vcf_read_options=vcf_read_options_1)
In [9]:
Copied!
pb.overlap(vcf_1, vcf_2, streaming=True, read_options1=read_options_1).sink_csv(
"/tmp/streaming_run.csv"
)
pb.overlap(vcf_1, vcf_2, streaming=True, read_options1=read_options_1).sink_csv(
"/tmp/streaming_run.csv"
)
{'.v4', '.sites', '.sv', '.1', '.vcf', '.gz'}
INFO:polars_bio.operation:Running in streaming mode... INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
In [12]:
Copied!
pl.read_csv("/tmp/streaming_run.csv").limit(3)
pl.read_csv("/tmp/streaming_run.csv").limit(3)
Out[12]:
shape: (3, 18)
chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | id_1 | ref_1 | alt_1 | qual_1 | filter_1 | svtype_1 | svlen_1 | id_2 | ref_2 | alt_2 | qual_2 | filter_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | str | i64 | i64 | str | str | str | f64 | str | str | i64 | str | str | str | f64 | str |
"chr21" | 5019150 | 5047500 | "chr21" | 5031905 | 5031905 | "gnomAD-SV_v3_DUP_chr21_029eb66… | "N" | "<DUP>" | 34.0 | "PASS" | "DUP" | 28350 | "" | "C" | "A" | 0.0 | "AC0;AS_VQSR" |
"chr21" | 5019150 | 5047500 | "chr21" | 5031905 | 5031905 | "gnomAD-SV_v3_DUP_chr21_029eb66… | "N" | "<DUP>" | 34.0 | "PASS" | "DUP" | 28350 | "" | "C" | "T" | 0.0 | "AC0;AS_VQSR" |
"chr21" | 5019150 | 5047500 | "chr21" | 5031909 | 5031909 | "gnomAD-SV_v3_DUP_chr21_029eb66… | "N" | "<DUP>" | 34.0 | "PASS" | "DUP" | 28350 | "" | "T" | "C" | 0.0 | "AC0;AS_VQSR" |
In [10]:
Copied!
pb.overlap(vcf_1, vcf_2, streaming=False, read_options1=read_options_1).collect().count()
pb.overlap(vcf_1, vcf_2, streaming=False, read_options1=read_options_1).collect().count()
{'.v4', '.sites', '.sv', '.1', '.vcf', '.gz'}
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz INFO:polars_bio:Table: gnomad_exomes_v4_1_sites_chr21_bgz registered for path: gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
0rows [00:00, ?rows/s]
Out[10]:
shape: (1, 18)
chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | id_1 | ref_1 | alt_1 | qual_1 | filter_1 | svtype_1 | svlen_1 | id_2 | ref_2 | alt_2 | qual_2 | filter_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 |
5. How to read a VCF from Google Life Sciences¶
In [16]:
Copied!
gcs_vcf_path = "gs://genomics-public-data/platinum-genomes/vcf/NA12878_S1.genome.vcf"
gcs_vcf_path = "gs://genomics-public-data/platinum-genomes/vcf/NA12878_S1.genome.vcf"
In [17]:
Copied!
info_fields=["AC", "AF"]
info_fields=["AC", "AF"]
In [18]:
Copied!
pb.read_vcf(gcs_vcf_path, info_fields=info_fields).limit(3).collect()
pb.read_vcf(gcs_vcf_path, info_fields=info_fields).limit(3).collect()
INFO:polars_bio:Table: na12878_s1_genome registered for path: gs://genomics-public-data/platinum-genomes/vcf/NA12878_S1.genome.vcf
Out[18]:
shape: (3, 10)
chrom | start | end | id | ref | alt | qual | filter | ac | af |
---|---|---|---|---|---|---|---|---|---|
str | u32 | u32 | str | str | str | f64 | str | list[i32] | list[f32] |
"chrM" | 1 | 1 | "" | "G" | "" | 0.0 | "PASS" | null | null |
"chrM" | 2 | 72 | "" | "A" | "" | 0.0 | "PASS" | null | null |
"chrM" | 73 | 73 | "" | "G" | "A" | 8752.780273 | "TruthSensitivityTranche99.90to… | [2] | [1.0] |
SQL data processing¶
Check SQL reference for details.
In [4]:
Copied!
pb.register_vcf(vcf_1, "gnomad_sv", thread_num=1, info_fields=["SVTYPE", "SVLEN"])
pb.register_vcf(vcf_1, "gnomad_sv", thread_num=1, info_fields=["SVTYPE", "SVLEN"])
INFO:polars_bio:Table: gnomad_sv registered for path: gs://gcp-public-data--gnomad/release/4.1/genome_sv/gnomad.v4.1.sv.sites.vcf.gz
In [5]:
Copied!
pb.sql("SELECT chrom, svtype FROM gnomad_sv").limit(3).collect()
pb.sql("SELECT chrom, svtype FROM gnomad_sv").limit(3).collect()
Out[5]:
shape: (3, 2)
chrom | svtype |
---|---|
str | str |
"chr1" | "DUP" |
"chr1" | "BND" |
"chr1" | "BND" |
In [12]:
Copied!
pb.sql("SELECT * FROM gnomad_sv WHERE SVTYPE = 'DEL' AND SVLEN > 1000").limit(3).collect()
pb.sql("SELECT * FROM gnomad_sv WHERE SVTYPE = 'DEL' AND SVLEN > 1000").limit(3).collect()
Out[12]:
shape: (3, 10)
chrom | start | end | id | ref | alt | qual | filter | svtype | svlen |
---|---|---|---|---|---|---|---|---|---|
str | u32 | u32 | str | str | str | f64 | str | str | i32 |
"chr1" | 22000 | 30000 | "gnomAD-SV_v3_DEL_chr1_fa103016" | "N" | "<DEL>" | 999.0 | "HIGH_NCR" | "DEL" | 8000 |
"chr1" | 40000 | 47000 | "gnomAD-SV_v3_DEL_chr1_b26f63f7" | "N" | "<DEL>" | 145.0 | "PASS" | "DEL" | 7000 |
"chr1" | 79086 | 88118 | "gnomAD-SV_v3_DEL_chr1_733c4ef0" | "N" | "<DEL:ME:LINE1>" | 344.0 | "UNRESOLVED" | "DEL" | 9032 |
In [6]:
Copied!
pb.sql("SELECT alt, count(*) as cnt FROM gnomad_sv group by alt").collect_schema()
pb.sql("SELECT alt, count(*) as cnt FROM gnomad_sv group by alt").collect_schema()
Out[6]:
Schema([('alt', String), ('cnt', Int64)])
In [7]:
Copied!
pb.sql("SELECT alt, count(*) as cnt FROM gnomad_sv group by alt").collect()
pb.sql("SELECT alt, count(*) as cnt FROM gnomad_sv group by alt").collect()
0rows [00:00, ?rows/s]
Out[7]:
shape: (13, 2)
alt | cnt |
---|---|
str | i64 |
"<DUP>" | 269326 |
"<BND>" | 356035 |
"<CNV>" | 721 |
"<DEL>" | 1197080 |
"<INS>" | 83441 |
… | … |
"<INS:ME:SVA>" | 17607 |
"<CPX>" | 15189 |
"<INV>" | 2193 |
"<DEL:ME:HERVK>" | 693 |
"<CTX>" | 99 |
In [9]:
Copied!
pb.sql("SELECT chrom, count(*) as cnt FROM gnomad_sv GROUP BY chrom ORDER BY chrom").collect()
pb.sql("SELECT chrom, count(*) as cnt FROM gnomad_sv GROUP BY chrom ORDER BY chrom").collect()
0rows [00:00, ?rows/s]
Out[9]:
shape: (24, 2)
chrom | cnt |
---|---|
str | i64 |
"chr1" | 182804 |
"chr10" | 96755 |
"chr11" | 95690 |
"chr12" | 97655 |
"chr13" | 63839 |
… | … |
"chr7" | 131866 |
"chr8" | 101224 |
"chr9" | 87748 |
"chrX" | 78076 |
"chrY" | 12488 |
In [8]:
Copied!
pb.sql("SELECT * FROM gnomad_sv WHERE chrom='chr1'", streaming=True).sink_csv("/tmp/gnomad_chr1.csv")
pb.sql("SELECT * FROM gnomad_sv WHERE chrom='chr1'", streaming=True).sink_csv("/tmp/gnomad_chr1.csv")
In [10]:
Copied!
pl.read_csv("/tmp/gnomad_chr1.csv").count()
pl.read_csv("/tmp/gnomad_chr1.csv").count()
Out[10]:
shape: (1, 10)
chrom | start | end | id | ref | alt | qual | filter | svtype | svlen |
---|---|---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
182804 | 182804 | 182804 | 182804 | 182804 | 182804 | 182804 | 182804 | 182804 | 182804 |
In [5]:
Copied!
pb.register_vcf(vcf_2, "gnomad_exomes", info_fields=["AC", "AF"])
pb.register_vcf(vcf_2, "gnomad_exomes", info_fields=["AC", "AF"])
INFO:polars_bio:Table: gnomad_exomes registered for path: gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz
In [9]:
Copied!
pb.sql("SELECT replace(chrom,'chr','') AS chrom, start, ac,af FROM gnomad_exomes WHERE array_element(af,1)>0.01").limit(10).collect()
pb.sql("SELECT replace(chrom,'chr','') AS chrom, start, ac,af FROM gnomad_exomes WHERE array_element(af,1)>0.01").limit(10).collect()
Out[9]:
shape: (10, 4)
chrom | start | ac | af |
---|---|---|---|
str | u32 | list[i32] | list[f32] |
"21" | 5033364 | [372992] | [0.337086] |
"21" | 5033539 | [1107064] | [0.996312] |
"21" | 5034629 | [15145] | [0.020082] |
"21" | 5035021 | [1811] | [0.307888] |
"21" | 5035108 | [2] | [0.010989] |
"21" | 5035658 | [255555] | [0.336447] |
"21" | 5035846 | [37233] | [0.286906] |
"21" | 5035921 | [1682] | [0.010757] |
"21" | 5116593 | [4032] | [0.018626] |
"21" | 5116760 | [101] | [0.014914] |
In [7]:
Copied!
pb.overlap("gnomad_sv", "gnomad_exomes").collect().count()
pb.overlap("gnomad_sv", "gnomad_exomes").collect().count()
INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...
0rows [00:00, ?rows/s]
Out[7]:
shape: (1, 20)
chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | id_1 | ref_1 | alt_1 | qual_1 | filter_1 | svtype_1 | svlen_1 | id_2 | ref_2 | alt_2 | qual_2 | filter_2 | ac_2 | af_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17875080 | 17580234 |
In [6]:
Copied!
pb.register_vcf(vcf_2, "gnomad_exomes")
pb.register_vcf(vcf_2, "gnomad_exomes")
INFO:polars_bio:Table: gnomad_exomes registered for path: gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz
In [7]:
Copied!
pb.overlap("gnomad_sv", "gnomad_exomes", streaming=True).sink_csv("/tmp/overlap.csv")
pb.overlap("gnomad_sv", "gnomad_exomes", streaming=True).sink_csv("/tmp/overlap.csv")
INFO:polars_bio.operation:Running in streaming mode... INFO:polars_bio.operation:Running Overlap operation with algorithm Coitrees and 1 thread(s)...