Entropy bounds for grammar compression

Micha{\l} Ga\'nczorz

arXiv:1804.08547·cs.DS·May 21, 2020

Entropy bounds for grammar compression

Micha{\l} Ga\'nczorz

PDF

TL;DR

This paper establishes bounds on the size of grammar compression encodings, explaining practical efficiency and limitations of methods like RePair and Greedy, and introduces new entropy bounds for string parsing.

Contribution

It provides theoretical bounds for common grammar compression encodings, explaining practical performance and limitations of RePair and Greedy algorithms, and introduces new entropy bounds for string parsing.

Findings

01

RePair's standard encoding achieves 1.5|S|H_k(S) size.

02

Stopping after certain iterations achieves |S|H_k(S) size.

03

The analysis explains why some methods outperform others in practice.

Abstract

Grammar compression represents a string as a context free grammar. Achieving compression requires encoding such grammar as a binary string; there are a few commonly used encodings. We bound the size of practically used encodings for several heuristical compression methods, including \RePair and \Greedy algorithms: the standard encoding of \RePair, which combines entropy coding and special encoding of a grammar, achieves $1.5∣ S ∣ H_{k} (S)$ , where $H_{k} (S)$ is $k$ -th order entropy of $S$ . We also show that by stopping after some iteration we can achieve $∣ S ∣ H_{k} (S)$ . This is particularly interesting, as it explains a phenomenon observed in practice: introducing too many nonterminals causes the bit-size to grow. We generalize our approach to other compression methods like \Greedy and a wide class of irreducible grammars as well as to other practically used bit encodings (including naive, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.