Stochastic model for the vocabulary growth in natural languages
Martin Gerlach, Eduardo G. Altmann

TL;DR
This paper introduces a stochastic model for vocabulary growth in natural languages, accounting for core and noncore words, and generalizes Zipf's and Heaps' laws to better fit historical language data.
Contribution
It presents a novel two-class vocabulary growth model that explains empirical data and extends classical linguistic laws with language-dependent parameters.
Findings
The model fits Google Ngram data well across languages.
Vocabulary composition of core words decays exponentially over time.
Two parameters depend only on language, not on the dataset.
Abstract
We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
