Scaling laws and fluctuations in the statistics of word frequencies

Martin Gerlach; Eduardo G. Altmann

arXiv:1406.4441·physics.soc-ph·November 5, 2014

Scaling laws and fluctuations in the statistics of word frequencies

Martin Gerlach, Eduardo G. Altmann

PDF

TL;DR

This paper explains the scaling laws and fluctuations in word frequency statistics using stochastic models, revealing how word usage variability impacts vocabulary growth and lexical richness in large text corpora.

Contribution

It introduces a unified stochastic framework that accounts for both vocabulary scaling and fluctuation phenomena in word frequency data.

Findings

01

Vocabulary size scales sublinearly with database size.

02

Fluctuations around average vocabulary size follow a specific scaling law.

03

Word co-occurrence correlations influence vocabulary variability.

Abstract

In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.