Faster Approximate Pattern Matching in Compressed Repetitive Texts
Travis Gagie, Pawe{\l} Gawrychowski, Christopher Hoobin and, Simon J. Puglisi

TL;DR
This paper introduces a new data structure for efficiently performing approximate pattern matching on highly repetitive, compressed texts like genomic databases, improving on previous methods by reducing complexity.
Contribution
It presents a simple, space-efficient data structure with improved time complexity for approximate pattern matching in compressed repetitive texts.
Findings
Supports fast substring extraction in compressed texts.
Enables efficient approximate pattern matching with reduced time complexity.
Works with straight-line programs of size proportional to the number of LZ77 phrases.
Abstract
Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with rules for a string of length , we can build an -word data structure that allows us to extract any substring of length in time. They also showed how, given a pattern of length and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to in in time. Rytter (2003) and Charikar et al. (2005) showed that is always at least the number of phrases in the LZ77 parse of , and gave algorithms for building…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies
