A Faster Grammar-Based Self-Index
Travis Gagie, Pawe{\l} Gawrychowski, Juha K\"arkk\"ainen and, Yakov Nekrich, Simon J. Puglisi

TL;DR
This paper introduces a more efficient grammar-based self-index for genomic data that reduces space and time complexity, enabling faster pattern searches in large compressed sequences.
Contribution
It presents a novel self-indexing method based on straight-line programs that improves space and search efficiency over previous approaches.
Findings
Self-index size is reduced to O(r + z log log n) space.
Pattern search time is improved to O(m^2 + occ log log n).
Balanced straight-line programs further optimize search time to O(m log m).
Abstract
To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with rules for a string (S [1..n]) whose LZ77 parse consists of phrases, we can store a self-index for in space such that, given a pattern (P [1..m]), we can list the occurrences of in in time. If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the term to . All previous self-indexes are larger or slower in the worst case.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Natural Language Processing Techniques
