Computing MEMs and Relatives on Repetitive Text Collections

Gonzalo Navarro

arXiv:2210.09914·cs.DS·September 6, 2023

Computing MEMs and Relatives on Repetitive Text Collections

Gonzalo Navarro

PDF

Open Access

TL;DR

This paper introduces efficient algorithms for computing Maximal Exact Matches (MEMs) on large, repetitive text collections using grammar-based compression, achieving near-optimal size and improved time complexities.

Contribution

It presents novel algorithms for MEM computation on grammar-compressed texts with improved time bounds and optimal size in terms of text repetitiveness.

Findings

01

Time complexity for general case: $O(m^2 \, \log^\epsilon n)$

02

Time complexity for locally consistent grammar: $O(m \log m (\log m + \log^\epsilon n))$

03

Structure size is optimal relative to text repetitiveness measure $\delta$

Abstract

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern $P [1.. m]$ on a large repetitive text collection $T [1.. n]$ , which is represented as a (hopefully much smaller) run-length context-free grammar of size $g_{r l}$ . We show that the problem can be solved in time $O (m^{2} lo g^{ϵ} n)$ , for any constant $ϵ > 0$ , on a data structure of size $O (g_{r l})$ . Further, on a locally consistent grammar of size $O (δ lo g \frac{n}{δ})$ , the time decreases to $O (m lo g m (lo g m + lo g^{ϵ} n))$ . The value $δ$ is a function of the substring complexity of $T$ and $Ω (δ lo g \frac{n}{δ})$ is a tight lower bound on the compressibility of repetitive texts $T$ , so our structure has optimal size in terms of $n$ and $δ$ . We extend our results to several related problems, such as finding $k$ -MEMs, MUMs, rare MEMs, and applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory