Adaptive encodings for small and fast compressed suffix arrays
Diego D\'iaz-Dom\'inguez, Veli M\"akinen

TL;DR
This paper introduces variable-length blocking (VLB), an adaptive encoding method for compressed suffix arrays that balances space and query speed by tailoring auxiliary data to local compressibility, outperforming existing methods.
Contribution
The paper proposes VLB, a novel encoding technique that adaptively compresses BWT regions, improving space efficiency and query performance in compressed suffix arrays.
Findings
VLB outperforms the r-index and sr-index in query time.
VLB retains space close to the sr-index.
VLB offers a better space-time tradeoff than the move data structure.
Abstract
Compressed suffix arrays (CSAs) index large repetitive collections and are key in many text applications. The r-index and its derivatives combine the run-length Burrows-Wheeler Transform (BWT) with suffix array sampling to achieve space proportional to the number of equal-symbol runs in the BWT. While effective for near-identical strings, their size grows quickly as variation increases, since the number of BWT runs is sensitive to edits. Existing approaches typically trade space for query speed, or vice versa, limiting their practicality at large scale. We introduce variable-length blocking (VLB), an encoding technique for BWT-based CSAs that adapts the amount of indexing information to local compressibility. The BWT is recursively divided into blocks of at most w runs (a parameter) and organized into a tree. Compressible regions appear near the root and store little auxiliary data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Network Packet Processing and Optimization
