Practical combinations of repetition-aware data structures
Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu, Raffinot

TL;DR
This paper investigates combining various repetition-aware data structures, like RLBWT, Lempel-Ziv factors, and compact directed acyclic word graphs, to improve indexing of highly-repetitive string collections in practical scenarios.
Contribution
It introduces practical variants that combine multiple measures of repetition, demonstrating their effectiveness and ease of implementation for indexing highly-repetitive data.
Findings
Combined data structures are space-efficient and scalable.
Variants outperform single-measure approaches on real datasets.
Practical implementations show competitive performance.
Abstract
Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct measures of repetition all grow sublinearly in the length of a highly-repetitive string. In this paper we explore the practical advantages of combining data structures whose size depends on distinct measures of repetition. The main ingredient of our structures is the run-length encoded BWT (RLBWT), which takes space proportional to the number of runs in the Burrows-Wheeler transform of a string. We describe a range of practical variants that combine RLBWT with the set of boundaries of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Video Analysis and Summarization · Music and Audio Processing
