Local Grammar-Based Coding Revisited
{\L}ukasz D\k{e}bowski

TL;DR
This paper advances the theoretical understanding of local grammar-based coding by establishing bounds, universality, and connections to linguistic power laws, with implications for language modeling.
Contribution
It introduces new bounds, a universal coding framework, and extends the theoretical foundation linking grammar-based codes to linguistic power laws.
Findings
Harmonic bounds simplify universality proofs.
Vocabulary size bounds relate to mutual information.
Finite vocabulary codes are proven to be universal.
Abstract
In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Algorithms and Data Compression · Error Correcting Code Techniques
