Fork me on GitHub

SeQuiLa User Guide

SeQuiLa is an ANSI-SQL compliant solution for efficient genomic intervals querying and processing built on top of Apache Spark. Range joins and depth of coverage computations are bread and butter for NGS analysis but the high volume of data make them execute very slowly or even failing to compute.

  • SeQuiLa is fast:

    • genome-wide analyses in less than a minute (for depth of coverage calculations) and several minutes (for range joins)
    • 22x+ speedup over Spark default join operation
    • up to 100x+ speedup for interval queries for BAM datasource using indexes (>= 0.4.1)
    • 100% accuracy in functional tests against GRanges and samtools
  • SeQuiLa is elastic:

    • growing catalogue of utility functions and operations including: featureCounts, countOverlaps and bdg_coverage
    • standard SQL DML/DDL like SPJG (select, predicate, join, group by), CTaS (create table as select), IaS (insert table as select) for easy BAM files manipulation
    • exposed parameters for further performance tuning
    • integration with third-party tools through SparkSQL Thrift JDBC driver
    • can be used natively in R with sparklyr-sequila package
    • possibility to use it as command line tool without any exposure to Scala/Spark/Hadoop
    • Docker images available for a quick start
    • can process data stored both on local as well as on distributed file systems (HDFS, S3, Ceph, etc.)
  • SeQuiLa is scalable:

    • implemented in Scala in Apache Spark 2.4.x environment
    • can be run on single computer (locally) or Hadoop cluster using YARN

Availability:

You can find SeQuiLa publicly available in following repositories:

Repo Link
GitHub https://github.com/ZSI-Bio/bdg-sequila
Maven(release) https://zsibio.ii.pw.edu.pl/nexus/repository/maven-releases/
sparklyr-sequila https://github.com/ZSI-Bio/bdg-sparklyr-sequila/
Docker Hub https://hub.docker.com/r/biodatageeks/bdg-sequila/

Release notes:

0.5.5 (2019-05-22)
  • improved algorithm for long reads support
  • added hadoop-bam support for long reads
  • fixed errors in hadoop-bam caused by not closing IO stream
  • standardizing showAllPositions parameter, leading to reduced output size
  • bumped Apache Spark version to 2.4.2
0.5.4 (2019-04-13)
  • support for Apache Spark: 2.4.1 and HDP 2.3.2.3.1.0.0-78
  • support for disq for reading BAM and CRAM files in coverage calculations
  • swappable alignment file read mechanism (spark.biodatageeks.readAligment.method parameter defaults to “hadoopBAM”)
  • initial support for long reads (e.g. Nanopore) using disq library

0.5

  • new result type (fixed lenght windows) for depth of coverage calculations

0.4.1

  • a new highly-optimized mosdepth distributed implementation for depth of coverage calculations
  • BAI indexes support for BAM data source
  • CRAM file format support
  • Intel Genomics Kernel Library (GKL) support for BAM files reading
  • CTAS (Create Table As Select) and IAS (Insert As Select) for BAM Files
  • experimental spark-bam support

0.4

  • completely rewritten R support as a sparklyr extension
  • experimental support for efficient coverage computation for BAMDatasource exposed as a table-valued function (bdg_coverage)
  • sample pruning mechanism for queries accessing only subset of samples from a table (BAMDatasource)
  • a new JDBC interface based on SequilaThriftServer