A Faster Grammar-Based Self-Index

Travis Gagie; Pawe{\l} Gawrychowski; Juha K\"arkk\"ainen and; Yakov Nekrich; Simon J. Puglisi

arXiv:1109.3954·cs.DS·September 28, 2012·31 cites

A Faster Grammar-Based Self-Index

Travis Gagie, Pawe{\l} Gawrychowski, Juha K\"arkk\"ainen and, Yakov Nekrich, Simon J. Puglisi

PDF

Open Access

TL;DR

This paper introduces a more efficient grammar-based self-index for genomic data that reduces space and time complexity, enabling faster pattern searches in large compressed sequences.

Contribution

It presents a novel self-indexing method based on straight-line programs that improves space and search efficiency over previous approaches.

Findings

01

Self-index size is reduced to O(r + z log log n) space.

02

Pattern search time is improved to O(m^2 + occ log log n).

03

Balanced straight-line programs further optimize search time to O(m log m).

Abstract

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh r + z lo g lo g n$ space such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh m^{2} + \occ lo g lo g n$ time. If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the $\Oh m^{2}$ term to $\Oh m lo g m$ . All previous self-indexes are larger or slower in the worst case.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Natural Language Processing Techniques