On vocabulary size of grammar-based codes

Lukasz Debowski

arXiv:cs/0701047·cs.IT·March 11, 2020

On vocabulary size of grammar-based codes

Lukasz Debowski

PDF

TL;DR

This paper explores the relationship between vocabulary size in grammar-based compression and excess code length, providing bounds and constructions that improve understanding of redundancy in computable codes.

Contribution

It introduces a method to construct universal grammar-based codes with easily bounded excess lengths, strengthening existing inequalities.

Findings

01

Bounded excess lengths for certain grammar-based codes

02

Improved inequalities relating vocabulary size and code redundancy

03

Enhanced understanding of redundancy in computable compression codes

Abstract

We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog of algorithmic mutual information. The aim is to strengthen inequalities which were discussed in a weaker form in linguistics but shed some light on redundancy of efficiently computable codes. The main contribution of the paper is a construction of universal grammar-based codes for which the excess lengths can be bounded easily.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.