SeQuiLa: an elastic, fast and scalable SQL-oriented solution for processing and querying genomic intervals

Abstract

Efficient processing of large-scale genomic datasets has recently become possible due to the application of ‘big data’ technologies in bioinformatics pipelines. We present SeQuiLa—a distributed, ANSI SQL-compliant solution for speedy querying and processing of genomic intervals that is available as an Apache Spark package. Proposed range join strategy is significantly (∼22×) faster than the default Apache Spark implementation and outperforms other state-of-the-art tools for genomic intervals processing.

Publication
Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2156–2158
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Supplementary notes can be added here, including code and math.