Fork me on GitHub

SeQuiLa User GuideΒΆ

SeQuiLa is an ANSI-SQL compliant solution for efficient genomic intervals querying and processing built on top of Apache Spark. Range joins and depth of coverage computations are bread and butter for NGS analysis but the high volume of data make them execute very slowly or even failing to compute.

  • SeQuiLa is fast:

    • genome-wide analyses in less than a minute (for depth of coverage calculations) and several minutes (for range joins)
    • 22x+ speedup over Spark default join operation
    • up to 100x+ speedup for interval queries for BAM datasource using indexes (>= 0.4.1)
    • 100% accuracy in functional tests against GRanges and samtools
  • SeQuiLa is elastic:

    • growing catalogue of utility functions and operations including: featureCounts, countOverlaps and bdg_coverage
    • standard SQL DML/DDL like SPJG (select, predicate, join, group by), CTaS (create table as select), IaS (insert table as select) for easy BAM files manipulation
    • exposed parameters for further performance tuning
    • integration with third-party tools through SparkSQL Thrift JDBC driver
    • can be used natively in R with sparklyr-sequila package
    • possibility to use it as command line tool without any exposure to Scala/Spark/Hadoop
    • Docker images available for a quick start
    • can process data stored both on local as well as on distributed file systems (HDFS, S3, Ceph, etc.)
  • SeQuiLa is scalable:

    • implemented in Scala in Apache Spark 2.2 environment
    • can be run on single computer (locally) or Hadoop cluster using YARN

Availability:

You can find SeQuiLa publicly available in following repositories:

Repo Link
GitHub https://github.com/ZSI-Bio/bdg-sequila
Maven(release) https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/
sparklyr-sequila https://github.com/ZSI-Bio/bdg-sparklyr-sequila/
Docker Hub https://hub.docker.com/r/biodatageeks/bdg-sequila/

Using SeQuiLa in your Scala code :

build.sbt

libraryDependencies +=  "org.biodatageeks" % "bdg-sequila_2.11" % "0.5.2"

resolvers +=  "biodatageeks-releases" at "https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/"
resolvers +=  "biodatageeks-snapshots" at "https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/"

Example.scala

import org.biodatageeks.utils.{SequilaRegister, UDFRegister}
import org.apache.spark.sql.SequilaSession
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
    .getOrCreate()
val ss = new SequilaSession(spark)
SequilaRegister.register(ss)
UDFRegister.register(ss)
ss.sql(...)

Release notes:

0.5

  • new result type (fixed lenght windows) for depth of coverage calculations

0.4.1

  • a new highly-optimized mosdepth distributed implementation for depth of coverage calculations
  • BAI indexes support for BAM data source
  • CRAM file format support
  • Intel Genomics Kernel Library (GKL) support for BAM files reading
  • CTAS (Create Table As Select) and IAS (Insert As Select) for BAM Files
  • experimental spark-bam support

0.4

  • completely rewritten R support as a sparklyr extension
  • experimental support for efficient coverage computation for BAMDatasource exposed as a table-valued function (bdg_coverage)
  • sample pruning mechanism for queries accessing only subset of samples from a table (BAMDatasource)
  • a new JDBC interface based on SequilaThriftServer