Big data for genomics

NGS analysis on big data platform

Big Data platform for genomics

Background:

We are working on delivering coherent and extensible research platform for convenient and fast multi-sample analyses statistical and machine learning experiments including but not limited to: calculations of variant frequencies across subpopulations, , calculations of cumulative allele frequencies in genes/exons, providing input for association studies, filtering variants based on multiple criteria, selecting variants that co-segregate disease in the family (Mendelian filters), Logistic regression, classification and clustering experiments

Methods

Since the primary goal of this project is to achieve reliable, scalable and high-performant solution, the proposed architecture is based on industry standard Open Source components from Hadoop and Kubernetes ecosystems. The solution can be divided into loosely coupled functional modules, which can be deployed and scaled-out separately. For efficient data processing we are utilizing Apache Spark execution engine, specifically its SparkSQL module with extensible Catalyst optimizer. SeQuiLa module, integrated with Catalyst will guarantee fast execution of typical bioinformatics operations (such as range joins, depth of coverage calculations). Depending on the detailed data model we may choose additional engines.

Results:

Our preliminary results prove that proposed solution can substantially improve multisample and single-sample analysis of state-of-the-art methods when compared to their default settings. We are currently focused on achieving high query performance using intelligent federation layer over Spark's optimizer.

  • Status: In progress
  • Started: April 2019
  • Lead: Agnieszka Szmurło / Marek Wiewiórka