March 21, 2019 / by Agnieszka Szmurło
Over the past several months we have been working on SeQuiLa solution (we like both tequila & SQL ;) which in general is application of big data technologies to solving computationally expensive genomic problems.
For the first part we have tackled distributed range joins with broadcastable Interval Trees injected into Apache Spark’s optimizer. That’s how first SeQuiLa package was created.
Secondly we have implemented event-based method of coverage calculations in distributed manner using accumulators & broadcast variables to reduces network shuffles.
Have a look at the documentation site: docs
Ah… The range-joins part is already published in peer-reviewed journal Bioinformatics: bioinf-seq
The coverage part is under review in GigaScience but you can read it already in biorxiv: biorxiv-cov
More of our is coming. We are still working on propagating SQL access and distributed processing for genomic data.