Scalable Construction of Text Indexes
Timo Bingmann, Simon Gog, and Florian Kurpicz

TL;DR
This paper introduces five scalable suffix array construction algorithms leveraging the Thrill big data framework, enabling processing of extremely large datasets for applications in data compression, mining, and bioinformatics.
Contribution
The paper presents novel suffix array algorithms designed for the Thrill framework, significantly improving scalability for large-scale string processing tasks.
Findings
Successfully processed input sizes orders of magnitude larger than previous methods
Demonstrated efficiency and scalability of the algorithms in big data environments
Enhanced applicability of suffix arrays in data-intensive domains
Abstract
The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array construction algorithms had to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows us to process input sizes in orders of magnitude that have not been considered before.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · DNA and Biological Computing
