Pileup and coverage
SeQuiLa’s Pileup and coverage benchmarks
Testing environment
Datasets
Type | File | Size [GB] | Download |
---|---|---|---|
ES | NA12878.proper.wes.md.bam | 17 | BAM BAI SBI |
WGS | NA12878.proper.wgs.md.bam | 278 | BAM BAI SBI |
Reference | Homo_sapiens_assembly18.fasta | 2.9 | FASTA FAI |
Tools
Software | Version | Pileup | Coverage | Multi-threaded | Distributed |
---|---|---|---|---|---|
ADAM | 0.36.0 | no | yes | yes | yes |
GATK | 4.2.3.0 | yes | yes(only intervals) | no | no |
GATK-Spark | 4.2.3.0 | yes | yes(only intervals) | yes | yes |
megadepth | 1.1.1 | no | yes | yes(only I/O) | no |
mosdepth | 0.3.2 | no | yes | yes(only I/O) | no |
sambamba | 0.8.1 | no | yes | no | no |
samtools | 1.9 | yes | yes | no | no |
samtools | 1.14 | yes | yes | yes(only coverage, I/O) | no |
SeQuiLa-cov | 0.6.11 | no | yes | yes | yes |
SeQuiLa | 1.0.0 | yes | yes | yes | yes |
Single node specification
Processor | Base freq [GHz] | CPUs | Total cores | Memory [GB] | OS Version | Disk |
---|---|---|---|---|---|---|
Intel(R) Xeon(R) E5-2618L v4 | 2.20 | 2 | 20(40) | 256 | RHEL 7.8(Maipo) | 3TB(RAID1) |
Hadoop cluster
Masters | Workers | Hadoop distribution | Total disk HDFS [TB] | Total YARN cores | Total YARN RAM [TB] | Net [Gbits] |
---|---|---|---|---|---|---|
6 | 34 | HDP 3.1.4 | 700 | 1360(680) | 6.8 | 100 |
SeQuiLa, ADAM and GATK(Spark) parameters
Parameter | Value | SeQuiLa only | Local-test | Cluster-test |
---|---|---|---|---|
spark.biodatageeks.pileup.useVectorizedOrcWriter | true | yes | yes | no |
spark.sql.orc.compression.codec | snappy | no | yes | yes |
spark.biodatageeks.readAligment.method | hadoopBAM | yes | yes | no |
spark.biodatageeks.readAligment.method | disq | yes | no | yes |
spark.serializer | org.apache.spark.serializer.KryoSerializer | yes1 | yes | yes |
spark.kryo.registrator | org.biodatageeks.sequila.pileup.serializers.CustomKryoRegistrator | yes | yes | yes |
spark.kryoserializer.buffer.max | 1024m | yes | yes | yes |
spark.hadoop.mapred.min.split.size | 268435456 | yes | no | yes |
spark.hadoop.mapred.min.split.size | 134217728 | yes | yes | no |
spark.driver.memory | 8g | no | no | yes |
spark.driver.memory | 16g | no | yes | no |
spark.executor.memory | 4g2 | no | yes | yes |
spark.dynamicAllocation.enabled | false | no | yes | yes |
Other parameters
Software | Operation | Command line options | Multithreading options |
---|---|---|---|
ADAM | coverage | -- coverage -collapse |
--spark-master local[$n] |
GATK | pileup | PileupSpark |
--spark-master local[$n] |
GATK | pileup | Pileup |
– |
megadepth | coverage | --coverage --require-mdz --keep-order |
--threads $(n-1) |
mosdepth | coverage | -x |
--threads $(n-1) |
sambamba | coverage | depth base |
–nthreads=$n3 |
samtools | coverage | depth |
--threads $(n-1) 4 |
samtools | pileup | mpileup -B -x -A -q 0 -Q 0 |
– |
SeQuiLa
Coverage
import org.apache.spark.sql.{SequilaSession, SparkSession}
val ss = SequilaSession(spark)
ss.sparkContext.setLogLevel("INFO")
val bamPath = "/scratch/wiewiom/WGS/NA12878.proper.wgs.md.bam"
val referencePath = "/home/wiewiom/data/Homo_sapiens_assembly18.fasta"
ss.time{
ss
.coverage(bamPath, referencePath)
.write
.orc("/tmp/sequila.coverage")
}
Pileup
import org.apache.spark.sql.{SequilaSession, SparkSession}
val ss = SequilaSession(spark)
ss.sparkContext.setLogLevel("INFO")
val bamPath = "/scratch/wiewiom/WGS/NA12878.proper.wgs.md.bam"
val referencePath = "/home/wiewiom/data/Homo_sapiens_assembly18.fasta"
ss.time{
ss
.pileup(bamPath, referencePath, true)
.write
.orc("/tmp/sequila.pileup")
}
Results single node
Coverage
Pileup
Results Hadoop cluster
Coverage
Pileup
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
Last modified July 26, 2024: Fix comet condition (#180) (15431c5)