On Hilberg's Law and Its Links with Guiraud's Law
{\L}ukasz D\c{e}bowski

TL;DR
This paper explores the theoretical connection between Hilberg's law and Guiraud's law, proposing a new derivation based on coding theory and grammar-based word definitions, with implications for understanding linguistic patterns.
Contribution
It introduces a novel derivation of Guiraud's law from Hilberg's hypothesis using mathematical conjectures and grammar-based word definitions, applicable even to unspaced texts.
Findings
Derivation of Guiraud's law from Hilberg's law
Words can be operationally defined via shortest context-free grammars
Model suggests probabilistic long-memory effects in human language
Abstract
Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg's hypothesis is true, we derive Guiraud's law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory and on several experiments suggesting that words can be defined approximately as the nonterminals of the shortest context-free grammar for the text. Such operational definition of words can be applied even to texts deprived of spaces, which do not allow for Mandelbrot's ``intermittent silence'' explanation of Zipf's and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic long-memory effects in human narration and might be capable of explaining Menzerath's law.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
