Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering

Vladimir Berman

arXiv:2511.21060·stat.ME·November 27, 2025

Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering

Vladimir Berman

PDF

Open Access

TL;DR

This paper proposes a geometric model explaining Zipf's law in language as a consequence of combinatorial and probabilistic constraints, rather than communicative efficiency, supported by simulations matching real language data.

Contribution

It introduces the Full Combinatorial Word Model (FCWM) that derives Zipf-like distributions from geometric mechanisms without linguistic assumptions.

Findings

01

Simulations match Zipf-like distributions in multiple languages.

02

Zipf's law emerges from geometric constraints, not efficiency.

03

Model predicts rank-frequency curves based on alphabet size and symbol probability.

Abstract

Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Linguistic Variation and Morphology