Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System
Hsiang-Huang Wu, Chien-Min Wang, Hsuan-Chi Kuo, Wei-Chun Chung and, Jan-Ming Ho

TL;DR
This paper presents a scalable and efficient method for constructing suffix arrays using MapReduce combined with an in-memory data store, significantly improving performance on large datasets.
Contribution
It introduces a novel scheme integrating distributed key-value stores with MapReduce to enhance suffix array construction scalability and efficiency.
Findings
Outperforms TeraSort in memory efficiency and scalability.
Successfully constructs suffix arrays for nearly 6.7 TB of data on small clusters.
Maintains high scalability and performance in sequence alignment tasks.
Abstract
Suffix Array (SA) is a cardinal data structure in many pattern matching applications, including data compression, plagiarism detection and sequence alignment. However, as the volumes of data increase abruptly, the construction of SA is not amenable to the current large-scale data processing frameworks anymore due to its intrinsic proliferation of suffixes during the construction. That is, ameliorating the performance by just adding the resources to the frameworks becomes less cost- effective, even having the severe diminishing returns. At issue now is whether we can permit SA construction to be more scalable and efficient for the everlasting accretion of data by creating a radical shift in perspective. Regarding TeraSort [1] as our baseline, we first demonstrate the fragile scalability of TeraSort and investigate what causes it through the experiments on the sequence alignment of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · Genomics and Phylogenetic Studies
