Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models
Vladimir Berman

TL;DR
This paper presents a simple non-linguistic model of text that derives Zipf's law and critical word length, providing a null hypothesis for understanding word statistics in natural language and large language models.
Contribution
It offers a unified, explicit mathematical derivation linking word length distribution, vocabulary growth, and rank-frequency law in a purely combinatorial model.
Findings
Word lengths follow a geometric distribution based on space probability.
A critical word length k* marks the transition from frequent to rare words.
Zipf's law emerges naturally from combinatorics without linguistic assumptions.
Abstract
We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Complex Network Analysis Techniques
