Computing MEMs and Relatives on Repetitive Text Collections
Gonzalo Navarro

TL;DR
This paper introduces efficient algorithms for computing Maximal Exact Matches (MEMs) on large, repetitive text collections using grammar-based compression, achieving near-optimal size and improved time complexities.
Contribution
It presents novel algorithms for MEM computation on grammar-compressed texts with improved time bounds and optimal size in terms of text repetitiveness.
Findings
Time complexity for general case: $O(m^2 \, \log^\epsilon n)$
Time complexity for locally consistent grammar: $O(m \log m (\log m + \log^\epsilon n))$
Structure size is optimal relative to text repetitiveness measure \(\delta\)
Abstract
We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern on a large repetitive text collection , which is represented as a (hopefully much smaller) run-length context-free grammar of size . We show that the problem can be solved in time , for any constant , on a data structure of size . Further, on a locally consistent grammar of size , the time decreases to . The value is a function of the substring complexity of and is a tight lower bound on the compressibility of repetitive texts , so our structure has optimal size in terms of and . We extend our results to several related problems, such as finding -MEMs, MUMs, rare MEMs, and applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory
