Faster Approximate Pattern Matching in Compressed Repetitive Texts

Travis Gagie; Pawe{\l} Gawrychowski; Christopher Hoobin and; Simon J. Puglisi

arXiv:1109.2930·cs.DS·November 1, 2012·19 cites

Faster Approximate Pattern Matching in Compressed Repetitive Texts

Travis Gagie, Pawe{\l} Gawrychowski, Christopher Hoobin and, Simon J. Puglisi

PDF

Open Access

TL;DR

This paper introduces a new data structure for efficiently performing approximate pattern matching on highly repetitive, compressed texts like genomic databases, improving on previous methods by reducing complexity.

Contribution

It presents a simple, space-efficient data structure with improved time complexity for approximate pattern matching in compressed repetitive texts.

Findings

01

Supports fast substring extraction in compressed texts.

02

Enables efficient approximate pattern matching with reduced time complexity.

03

Works with straight-line programs of size proportional to the number of LZ77 phrases.

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with $r$ rules for a string $s$ of length $n$ , we can build an $\Oh r$ -word data structure that allows us to extract any substring of length $m$ in $\Oh lo g n + m$ time. They also showed how, given a pattern $p$ of length $m$ and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to $p$ in $s$ in $\Oh r (min (mk, k^{4} + m) + lo g n) + \occ$ time. Rytter (2003) and Charikar et al. (2005) showed that $r$ is always at least the number $z$ of phrases in the LZ77 parse of $s$ , and gave algorithms for building…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies