Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in   Apache Spark

Ylenia Galluzzo; Raffaele Giancarlo; Mario Randazzo; Simona E. Rombo

arXiv:2107.03341·cs.DS·July 8, 2021

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

Ylenia Galluzzo, Raffaele Giancarlo, Mario Randazzo, Simona E. Rombo

PDF

1 Repo

TL;DR

This paper introduces scalable algorithms for computing the Burrows Wheeler Transform using Apache Spark, enabling efficient processing of large genomic datasets by distributing index computation across cloud resources.

Contribution

It presents the first algorithms that distribute Burrows Wheeler Transform index computation, not just data, leveraging Big Data frameworks for large-scale sequence processing.

Findings

01

Algorithms successfully handle large NGS datasets

02

Distributed computation improves processing efficiency

03

Implementation in Spark demonstrates scalability

Abstract

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MR6996/spark-bwt
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.