Improved Grammar-Based Compressed Indexes

Francisco Claude; Gonzalo Navarro

arXiv:1110.4493·cs.DS·October 21, 2011·2 cites

Improved Grammar-Based Compressed Indexes

Francisco Claude, Gonzalo Navarro

PDF

Open Access

TL;DR

This paper presents a novel grammar-based compressed index that enables efficient pattern searching and substring extraction in compressed texts, with search times logarithmic in the grammar size and space close to the text size.

Contribution

It introduces the first grammar-compressed index supporting searches with time complexity logarithmic in the grammar size, improving efficiency over previous methods.

Findings

01

Supports pattern search in O((m^2/ε) log (log u / log n) + occ log n) time.

02

Uses space close to the size of the grammar representation, N log u bits.

03

Enables substring extraction in time proportional to substring length.

Abstract

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text $T [1.. u]$ that is represented by a (context-free) grammar of $n$ (terminal and nonterminal) symbols and size $N$ (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of $T$ takes $N l g n$ bits of space. Our representation requires $2 N l g n + N l g u + ϵ n l g n + o (N l g n)$ bits of space, for any $0 < ϵ \leq 1$ . It can find the positions of the $occ$ occurrences of a pattern of length $m$ in $T$ in $O ((m^{2} / ϵ) l g (\frac{l g u}{l g n}) + occ l g n)$ time, and extract any substring of length $ℓ$ of $T$ in time $O (ℓ + h l g (N / h))$ , where $h$ is the height of the grammar tree.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · Cellular Automata and Applications