Partial Data Compression and Text Indexing via Optimal Suffix Multi-Selection
Gianni Franceschini, Roberto Grossi, S. Muthukrishnan

TL;DR
This paper introduces an optimal adaptive method for partial suffix-based data compression and text indexing, significantly reducing computation time for extracting segments from large texts by solving the suffix multi-selection problem efficiently.
Contribution
It presents a novel optimal algorithm for suffix multi-selection, improving partial suffix computation times to O(K log K + N), matching bounds for atomic element selection.
Findings
Partial suffix computations achieved in O(K log K + N) time.
Suffix multi-selection solved optimally with Theta(N log N - sum Delta_j log Delta_j + N).
Method applies to data compression and text indexing applications.
Abstract
Consider an input text string T[1,N] drawn from an unbounded alphabet. We study partial computation in suffix-based problems for Data Compression and Text Indexing such as (I) retrieve any segment of K<=N consecutive symbols from the Burrows-Wheeler transform of T, and (II) retrieve any chunk of K<=N consecutive entries of the Suffix Array or the Suffix Tree. Prior literature would take O(N log N) comparisons (and time) to solve these problems by solving the total problem of building the entire Burrows-Wheeler transform or Text Index for T, and performing a post-processing to single out the wanted portion. We introduce a novel adaptive approach to partial computational problems above, and solve both the partial problems in O(K log K + N) comparisons and time, improving the best known running times of O(N log N) for K=o(N). These partial-computation problems are intimately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing
