Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
Vladimir Berman

TL;DR
This paper proposes a geometric model explaining Zipf's law in language as a consequence of combinatorial and probabilistic constraints, rather than communicative efficiency, supported by simulations matching real language data.
Contribution
It introduces the Full Combinatorial Word Model (FCWM) that derives Zipf-like distributions from geometric mechanisms without linguistic assumptions.
Findings
Simulations match Zipf-like distributions in multiple languages.
Zipf's law emerges from geometric constraints, not efficiency.
Model predicts rank-frequency curves based on alphabet size and symbol probability.
Abstract
Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Linguistic Variation and Morphology
