Interval joins

SeQuiLa’s interval joins benchmark

Performance tests description

In order to evaluate our range join strategy we have run a number of tests using both one-node and a Hadoop cluster installations. In that way we were able to analyze both vertical (by means of adding computing resources such as CPU/RAM on one machine) as well as horizontal (by means of adding resources on multiple machines) scalability.

The main idea of the test was to compare performance of SeQuiLa with other tools like featureCounts and genAp that can be used to compute the number of reads intersecting predefined genomic intervals. It is by no means one of the most commonly used operations in both DNA and RNA-seq data processing, most notably in gene differential expression and copy number variation-calling. featureCounts performance results have been treated as a baseline. In order to show the difference between the naive approach using the default range join strategy available in Spark SQL and SeQuiLa interval-tree one, we have included it in the single-node test.

Testing enviornment

Datasets

Test name	Format	Size [GB]	Row count
WES-SN	BAM	17	161544693
WES-SN	ADAM	14	161544693
WES-SN	BED	0.0045	193557
WGS-CL	BAM	273	2617420313
WGS-CL	ADAM	215	2617420313
WGS-CL	BED	0.0016	68191

WES-SN - tests performed on a single node using WES dataset

WGS-CL - tests performed on a cluster using WGS dataset

Hadoop cluster

Masters	Workers	Hadoop distribution	Total disk HDFS [TB]	Total YARN cores	Total YARN RAM [TB]	Net [Gbits]
1	4	CDH 5.12	5	96	256	10

Apache Spark configuration

Parameter	Value	SeQuiLa only	Local-test	Cluster-test
spark.driver.memory	8g	no	yes	yes
spark.executor.memory	4g-8g	no	yes	yes
spark.executor.cores	2-4	no	yes	yes
spark.dynamicAllocation.enabled	false	no	yes	yes

Software

Software	Version
Apache Spark	2.2.1
Scala	2.11.8
JDK	1.8
ADAM	0.22
featureCounts	1.6.0

SeQuiLa

BAM table

    val bamPath = "NA12878*.bam"
    spark.sql(
      s"""
         |CREATE TABLE reads
         |USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
         |OPTIONS(path "${bamPath}")
         |
      """.stripMargin)
    spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()

    +----------+-----+---+
    |contigName|start|end|
    +----------+-----+---+
    |      chr1|   34|109|
    +----------+-----+---+

ADAM table

    val adamPath = "NA12878*.adam"
    spark.sql(
      s"""
         |CREATE TABLE reads
         |USING org.biodatageeks.sequila.datasources.ADAM.ADAMDataSource
         |OPTIONS(path "${adamPath}")
         |
      """.stripMargin)
    spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()

    +----------+-----+---+
    |contigName|start|end|
    +----------+-----+---+
    |      chr1|   34|109|
    +----------+-----+---+

BED table

    val  bedPath="tgp_exome_hg18.bed"
    spark.sql(s"""
        |CREATE TABLE targets
        |USING org.biodatageeks.sequila.datasources.BED.BEDDataSource
        |OPTIONS (path "file:///${bedPath}")""".stripMargin)
    spark.sql("SELECT contig,post_start, pos_end FROM targets LIMIT 1").show

    +------+---------+--------+
    |contig|pos_start| pos_end|
    +------+---------+--------+
    |  chr1|     4806|    4926|
    +------+---------+--------+

    SELECT targets.contigName,targets.start,targets.end,count(*) FROM reads JOIN targets
         ON (targets.contigName=reads.contigName
         AND
         CAST(reads.end AS INTEGER)>=CAST(targets.start AS INTEGER)
         AND
         CAST(reads.start AS INTEGER)<=CAST(targets.end AS INTEGER)
         )
         GROUP BY targets.contigName,targets.start,targets.end

Results single node

Results Hadoop cluster

Conclusions

SeQuiLa when run in parallel outperforms selected competing tools in terms of speed on single node (1.7-22.1x) and cluster (3.2-4.7x). SeQuiLa strategy involving broadcasting interval forest with all data columns (SeQuiLa_it_all) performs best in most of the cases (no network shuffling required), whereas broadcasting intervals with identifiers only (SeQuiLa_it_int) performs comparable to, or better than GenAp. All algorithms favours columnar (ADAM) to row oriented (BAM) file format due to column pruning and disk I/O operations reduction.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified July 26, 2024: Fix comet condition (#180) (15431c5)