Scalable Construction of Text Indexes

Timo Bingmann; Simon Gog; and Florian Kurpicz

arXiv:1610.03007·cs.DS·October 11, 2016·1 cites

Scalable Construction of Text Indexes

Timo Bingmann, Simon Gog, and Florian Kurpicz

PDF

Open Access

TL;DR

This paper introduces five scalable suffix array construction algorithms leveraging the Thrill big data framework, enabling processing of extremely large datasets for applications in data compression, mining, and bioinformatics.

Contribution

The paper presents novel suffix array algorithms designed for the Thrill framework, significantly improving scalability for large-scale string processing tasks.

Findings

01

Successfully processed input sizes orders of magnitude larger than previous methods

02

Demonstrated efficiency and scalability of the algorithms in big data environments

03

Enhanced applicability of suffix arrays in data-intensive domains

Abstract

The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array construction algorithms had to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows us to process input sizes in orders of magnitude that have not been considered before.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · DNA and Biological Computing