Interval joins
Performance tests description
In order to evaluate our range join strategy we have run a number of tests using both one-node and a Hadoop cluster installations. In that way we were able to analyze both vertical (by means of adding computing resources such as CPU/RAM on one machine) as well as horizontal (by means of adding resources on multiple machines) scalability.
The main idea of the test was to compare performance of SeQuiLa with other tools like featureCounts and genAp that can be used to compute the number of reads intersecting predefined genomic intervals. It is by no means one of the most commonly used operations in both DNA and RNA-seq data processing, most notably in gene differential expression and copy number variation-calling. featureCounts performance results have been treated as a baseline. In order to show the difference between the naive approach using the default range join strategy available in Spark SQL and SeQuiLa interval-tree one, we have included it in the single-node test.
Testing enviornment
Datasets
Test name | Format | Size [GB] | Row count |
---|---|---|---|
WES-SN | BAM | 17 | 161544693 |
WES-SN | ADAM | 14 | 161544693 |
WES-SN | BED | 0.0045 | 193557 |
WGS-CL | BAM | 273 | 2617420313 |
WGS-CL | ADAM | 215 | 2617420313 |
WGS-CL | BED | 0.0016 | 68191 |
WES-SN - tests performed on a single node using WES dataset
WGS-CL - tests performed on a cluster using WGS dataset
Hadoop cluster
Masters | Workers | Hadoop distribution | Total disk HDFS [TB] | Total YARN cores | Total YARN RAM [TB] | Net [Gbits] |
---|---|---|---|---|---|---|
1 | 4 | CDH 5.12 | 5 | 96 | 256 | 10 |
Apache Spark configuration
Parameter | Value | SeQuiLa only | Local-test | Cluster-test |
---|---|---|---|---|
spark.driver.memory | 8g | no | yes | yes |
spark.executor.memory | 4g-8g | no | yes | yes |
spark.executor.cores | 2-4 | no | yes | yes |
spark.dynamicAllocation.enabled | false | no | yes | yes |
Software
Software | Version |
---|---|
Apache Spark | 2.2.1 |
Scala | 2.11.8 |
JDK | 1.8 |
ADAM | 0.22 |
featureCounts | 1.6.0 |
SeQuiLa
BAM table
val bamPath = "NA12878*.bam"
spark.sql(
s"""
|CREATE TABLE reads
|USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
|OPTIONS(path "${bamPath}")
|
""".stripMargin)
spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()
+----------+-----+---+
|contigName|start|end|
+----------+-----+---+
| chr1| 34|109|
+----------+-----+---+
ADAM table
val adamPath = "NA12878*.adam"
spark.sql(
s"""
|CREATE TABLE reads
|USING org.biodatageeks.sequila.datasources.ADAM.ADAMDataSource
|OPTIONS(path "${adamPath}")
|
""".stripMargin)
spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()
+----------+-----+---+
|contigName|start|end|
+----------+-----+---+
| chr1| 34|109|
+----------+-----+---+
BED table
val bedPath="tgp_exome_hg18.bed"
spark.sql(s"""
|CREATE TABLE targets
|USING org.biodatageeks.sequila.datasources.BED.BEDDataSource
|OPTIONS (path "file:///${bedPath}")""".stripMargin)
spark.sql("SELECT contig,post_start, pos_end FROM targets LIMIT 1").show
+------+---------+--------+
|contig|pos_start| pos_end|
+------+---------+--------+
| chr1| 4806| 4926|
+------+---------+--------+
SELECT targets.contigName,targets.start,targets.end,count(*) FROM reads JOIN targets
ON (targets.contigName=reads.contigName
AND
CAST(reads.end AS INTEGER)>=CAST(targets.start AS INTEGER)
AND
CAST(reads.start AS INTEGER)<=CAST(targets.end AS INTEGER)
)
GROUP BY targets.contigName,targets.start,targets.end
Results single node
Results Hadoop cluster
Conclusions
SeQuiLa when run in parallel outperforms selected competing tools in terms of speed on single node (1.7-22.1x) and cluster (3.2-4.7x). SeQuiLa strategy involving broadcasting interval forest with all data columns (SeQuiLa_it_all) performs best in most of the cases (no network shuffling required), whereas broadcasting intervals with identifiers only (SeQuiLa_it_int) performs comparable to, or better than GenAp. All algorithms favours columnar (ADAM) to row oriented (BAM) file format due to column pruning and disk I/O operations reduction.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.