March 21, 2019 / by Agnieszka Szmurło

Short update on our research

Over the past several months we have been working on SeQuiLa solution (we like both tequila & SQL ;) which in general is application of big data technologies to solving computationally expensive genomic problems.

For the first part we have tackled distributed range joins with broadcastable Interval Trees injected into Apache Spark’s optimizer. That’s how first SeQuiLa package was created.

Secondly we have implemented event-based method of coverage calculations in distributed manner using accumulators & broadcast variables to reduces network shuffles.

Have a look at the documentation site: docs

Ah… The range-joins part is already published in peer-reviewed journal Bioinformatics: bioinf-seq

The coverage part is under review in GigaScience but you can read it already in biorxiv: biorxiv-cov

More of our is coming. We are still working on propagating SQL access and distributed processing for genomic data.