Distributed pipelines

Streamlining NGS analyses with distributed pipelines

Big Data platform for genomics

Background:

Analyses of ever growing NGS data sets are computationally intensive and require substantial amount of both time and storage. In this work we focus on delivering framework for fast and distributed Single Nucleotide Variants calling pipeline that utilizes big data technology stack.

Methods

We are working the framework for defining and running bioinformatics pipeline encompassing secondary (reads alignment, variant calling) and tertiary (variants’ decomposition and annotation) analysis. Owing to distributed computation model data the computations on data fragments are executed in parallel on independent computational nodes what leads to significant performance gain. In the core of our solution we build upon current state-of-the art tools wrapped in docker containers still allowing the user to set all of the custom options. Additionally we provide means of authoring pipeline by specifying steps in declarative way. Finally, the results of computations can be stored traditionally as text files as well as in modern optimized for distributed access binary formats.

Results:

We foresee that the proposed approach for fast and performant pipelines may give rise to higher adoption of NGS analysis from sequencing data on commodity hardware. Short processing time and data stored on distributed file system may promote further downstream analyses using scalable algorithms.

  • Status: In progress
  • Started: October 2018
  • Lead: Marek Wiewiórka & Agnieszka Szmurło