Pileup and coverage
SeQuiLa’s Pileup and coverage benchmarks
Testing environment
Datasets
| Type | File | Size [GB] | Download |
|---|---|---|---|
| ES | NA12878.proper.wes.md.bam | 17 | BAM BAI SBI |
| WGS | NA12878.proper.wgs.md.bam | 278 | BAM BAI SBI |
| Reference | Homo_sapiens_assembly18.fasta | 2.9 | FASTA FAI |
Tools
| Software | Version | Pileup | Coverage | Multi-threaded | Distributed |
|---|---|---|---|---|---|
| ADAM | 0.36.0 | no | yes | yes | yes |
| GATK | 4.2.3.0 | yes | yes(only intervals) | no | no |
| GATK-Spark | 4.2.3.0 | yes | yes(only intervals) | yes | yes |
| megadepth | 1.1.1 | no | yes | yes(only I/O) | no |
| mosdepth | 0.3.2 | no | yes | yes(only I/O) | no |
| sambamba | 0.8.1 | no | yes | no | no |
| samtools | 1.9 | yes | yes | no | no |
| samtools | 1.14 | yes | yes | yes(only coverage, I/O) | no |
| SeQuiLa-cov | 0.6.11 | no | yes | yes | yes |
| SeQuiLa | 1.0.0 | yes | yes | yes | yes |
Single node specification
| Processor | Base freq [GHz] | CPUs | Total cores | Memory [GB] | OS Version | Disk |
|---|---|---|---|---|---|---|
| Intel(R) Xeon(R) E5-2618L v4 | 2.20 | 2 | 20(40) | 256 | RHEL 7.8(Maipo) | 3TB(RAID1) |
Hadoop cluster
| Masters | Workers | Hadoop distribution | Total disk HDFS [TB] | Total YARN cores | Total YARN RAM [TB] | Net [Gbits] |
|---|---|---|---|---|---|---|
| 6 | 34 | HDP 3.1.4 | 700 | 1360(680) | 6.8 | 100 |
SeQuiLa, ADAM and GATK(Spark) parameters
| Parameter | Value | SeQuiLa only | Local-test | Cluster-test |
|---|---|---|---|---|
| spark.biodatageeks.pileup.useVectorizedOrcWriter | true | yes | yes | no |
| spark.sql.orc.compression.codec | snappy | no | yes | yes |
| spark.biodatageeks.readAligment.method | hadoopBAM | yes | yes | no |
| spark.biodatageeks.readAligment.method | disq | yes | no | yes |
| spark.serializer | org.apache.spark.serializer.KryoSerializer | yes1 | yes | yes |
| spark.kryo.registrator | org.biodatageeks.sequila.pileup.serializers.CustomKryoRegistrator | yes | yes | yes |
| spark.kryoserializer.buffer.max | 1024m | yes | yes | yes |
| spark.hadoop.mapred.min.split.size | 268435456 | yes | no | yes |
| spark.hadoop.mapred.min.split.size | 134217728 | yes | yes | no |
| spark.driver.memory | 8g | no | no | yes |
| spark.driver.memory | 16g | no | yes | no |
| spark.executor.memory | 4g2 | no | yes | yes |
| spark.dynamicAllocation.enabled | false | no | yes | yes |
Other parameters
| Software | Operation | Command line options | Multithreading options |
|---|---|---|---|
| ADAM | coverage | -- coverage -collapse |
--spark-master local[$n] |
| GATK | pileup | PileupSpark |
--spark-master local[$n] |
| GATK | pileup | Pileup |
– |
| megadepth | coverage | --coverage --require-mdz --keep-order |
--threads $(n-1) |
| mosdepth | coverage | -x |
--threads $(n-1) |
| sambamba | coverage | depth base |
–nthreads=$n3 |
| samtools | coverage | depth |
--threads $(n-1)4 |
| samtools | pileup | mpileup -B -x -A -q 0 -Q 0 |
– |
SeQuiLa
Coverage
import org.apache.spark.sql.{SequilaSession, SparkSession}
val ss = SequilaSession(spark)
ss.sparkContext.setLogLevel("INFO")
val bamPath = "/scratch/wiewiom/WGS/NA12878.proper.wgs.md.bam"
val referencePath = "/home/wiewiom/data/Homo_sapiens_assembly18.fasta"
ss.time{
ss
.coverage(bamPath, referencePath)
.write
.orc("/tmp/sequila.coverage")
}
Pileup
import org.apache.spark.sql.{SequilaSession, SparkSession}
val ss = SequilaSession(spark)
ss.sparkContext.setLogLevel("INFO")
val bamPath = "/scratch/wiewiom/WGS/NA12878.proper.wgs.md.bam"
val referencePath = "/home/wiewiom/data/Homo_sapiens_assembly18.fasta"
ss.time{
ss
.pileup(bamPath, referencePath, true)
.write
.orc("/tmp/sequila.pileup")
}
Results single node
Coverage
Pileup
Results Hadoop cluster
Coverage
Pileup
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
Last modified July 26, 2024: Fix comet condition (#180) (15431c5)