Rank dynamics of word usage at multiple scales
Jos\'e A. Morales, Ewan Colman, Sergio S\'anchez, Fernanda S\'anchez-Puig, Carlos Pineda, Gerardo I\~niguez, Germinal Cocho, Jorge Flores, Carlos Gershenson

TL;DR
This study analyzes the evolution of word usage across multiple languages using large-scale N-gram data, revealing universal patterns and the importance of linguistic structure at different scales.
Contribution
It introduces a comprehensive analysis of rank dynamics in language, demonstrating that N-gram statistics capture features beyond individual word usage and proposing a null model for linguistic structure.
Findings
Identification of universal rank dynamics properties across languages
Existence of a core set of words essential for language understanding
N-gram statistics cannot be fully explained by word statistics alone
Abstract
The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books -grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of -grams in a given rank, the probability that an -gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
