TL;DR
This paper introduces scalable algorithms for computing the Burrows Wheeler Transform using Apache Spark, enabling efficient processing of large genomic datasets by distributing index computation across cloud resources.
Contribution
It presents the first algorithms that distribute Burrows Wheeler Transform index computation, not just data, leveraging Big Data frameworks for large-scale sequence processing.
Findings
Algorithms successfully handle large NGS datasets
Distributed computation improves processing efficiency
Implementation in Spark demonstrates scalability
Abstract
With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
