Universal Compressed Text Indexing

Gonzalo Navarro; Nicola Prezza

arXiv:1803.09520·cs.DS·December 17, 2020

Universal Compressed Text Indexing

Gonzalo Navarro, Nicola Prezza

PDF

TL;DR

This paper introduces the first universal compressed self-index based on string attractors, capable of indexing any dictionary compression scheme, including macro schemes and collage systems, with efficient query support.

Contribution

It develops a universal index leveraging string attractors, enabling indexing of any dictionary-compressed text, unifying various schemes under a single framework.

Findings

01

Supports locating pattern occurrences in compressed texts efficiently.

02

Achieves space complexity proportional to the size of the string attractor.

03

First index applicable to general macro schemes and collage systems.

Abstract

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.