A fast and simple $O (z \log n)$-space index for finding approximately   longest common substrings

Nick Fagan; Jorge Hermo Gonz\'alez; Travis Gagie

arXiv:2211.13434·cs.DS·December 6, 2022

A fast and simple $O (z \log n)$-space index for finding approximately longest common substrings

Nick Fagan, Jorge Hermo Gonz\'alez, Travis Gagie

PDF

Open Access

TL;DR

This paper introduces a space-efficient index for large texts that enables approximate longest common substring searches with high probability, using only $O(z \, \log n)$ space where $z$ is the LZ77 parse size.

Contribution

It presents a novel index structure that efficiently supports approximate LCS queries with sublinear space proportional to the LZ77 parse size.

Findings

01

Index uses $O(z \log n)$ space.

02

Query time is $O(m \log \log z + \mathrm{polylog}(m+z))$ with high probability.

03

Achieves near-linear approximation of the longest common substring.

Abstract

We describe how, given a text $T [1.. n]$ and a positive constant $ϵ$ , we can build a simple $O (z lo g n)$ -space index, where $z$ is the number of phrases in the LZ77 parse of $T$ , such that later, given a pattern $P [1.. m]$ , in $O (m lo g lo g z + polylog (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ϵ)$ -fraction of the length of a longest common substring of $P$ and $T$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing