The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework
Vladimir Berman

TL;DR
This paper introduces a morphemic combinatorial model explaining word length distributions and Zipf-like frequency curves, showing these patterns emerge from morphological structure alone without needing meaning or communication optimization.
Contribution
It presents a novel probabilistic model based on morpheme slots that accounts for linguistic statistical patterns without relying on traditional explanations.
Findings
Word length distribution matches real language data
Zipf-like frequency curves emerge from the model
Patterns are produced without semantic or communicative factors
Abstract
We present a simple structure based model of how words are formed from morphemes. The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves. In contrast to classical explanations based on random text or communication efficiency, our approach uses only the combinatorial organization of prefixes, roots, suffixes and inflections. In this Morphemic Combinatorial Word Model, a word is created by activating several positional slots. Each slot turns on with a certain probability and selects one morpheme from its inventory. Morphemes are treated as stable building blocks that regularly appear in word formation and have characteristic positions. This mechanism produces realistic word length patterns with a concentrated middle zone and a thin long tail, closely matching real languages. Simulations with synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Syntax, Semantics, Linguistic Variation
