The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models
Ferm\'in Moscoso del Prado Mart\'in, Suchir Salhan

TL;DR
This paper presents a unified information-theoretic framework explaining phoneme frequency distributions across languages at both macro and micro levels, linking inventory size, entropy, and phoneme probabilities.
Contribution
It introduces a macroscopic Dirichlet distribution model and a microscopic Maximum Entropy model to explain phoneme frequency patterns across languages.
Findings
Phoneme rank-frequency distributions follow a Dirichlet order statistic.
Larger phonemic inventories have lower relative entropy.
Maximum Entropy models accurately predict language-specific phoneme probabilities.
Abstract
We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Phonetics and Phonology Research · Animal Vocal Communication and Behavior
