Towards a Definitive Compressibility Measure for Repetitive Sequences
Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

TL;DR
This paper introduces a new measure, delta, for the compressibility of repetitive sequences, which is computable in linear time, monotonic, and provides tighter bounds on string compressibility than previous measures like gamma and z.
Contribution
The paper proposes delta, a smaller, efficiently computable, and monotonic measure that better captures the compressibility of repetitive sequences compared to existing measures.
Findings
delta can be strictly smaller than gamma by a logarithmic factor
Strings requiring Omega(delta log(n/delta)) space for encoding are constructed, showing optimality of this bound
Run-length grammars of size O(delta log(n/delta)) can be built, outperforming non-run-length grammars
Abstract
Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size of the Lempel--Ziv parse are frequently used to estimate it. The size of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size of the smallest string \emph{attractor}, was introduced. The measure lower bounds all the previous relevant ones, yet length- strings can be represented and efficiently indexed within space , which also upper bounds most measures. While is certainly a better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Handwritten Text Recognition Techniques · Natural Language Processing Techniques
