At the Roots of Dictionary Compression: String Attractors

Dominik Kempa; Nicola Prezza

arXiv:1710.10964·cs.DS·December 17, 2020

At the Roots of Dictionary Compression: String Attractors

Dominik Kempa, Nicola Prezza

PDF

TL;DR

This paper introduces the concept of string attractors as a unifying framework for dictionary compression techniques, providing new insights, complexity results, and optimal data structures for random access in compressed texts.

Contribution

It formalizes string attractors, establishes their relation to existing dictionary compressors, and develops optimal data structures for random access supporting all these schemes.

Findings

01

String attractors unify various dictionary compression methods.

02

Deciding small k-attractors is NP-complete for k≥3.

03

A universal data structure supports optimal random access for all schemes.

Abstract

A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all text's substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.