Rank diversity of languages: Generic behavior in computational linguistics
Germinal Cocho, Jorge Flores, Carlos Gershenson, Carlos Pineda, Sergio, S\'anchez

TL;DR
This study introduces rank diversity as a new measure of how word ranks change over time in languages, revealing universal patterns and categorizing words into stable, general, and context-specific regimes across six European languages.
Contribution
It presents the concept of rank diversity, demonstrates its universal lognormal distribution, and proposes a Gaussian random walk model to explain word rank variations over time.
Findings
Rank diversity follows a universal lognormal distribution.
Languages have similar core sizes of basic communication words.
A Gaussian random walk model reproduces observed rank variations.
Abstract
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
