Statistical laws in linguistics
Eduardo G. Altmann, Martin Gerlach

TL;DR
This paper reviews statistical laws in linguistics, emphasizing the importance of modeling fluctuations to accurately interpret these laws and avoid false falsifications due to large observed variances.
Contribution
It highlights the necessity of incorporating models that account for fluctuations to properly test and interpret linguistic statistical laws.
Findings
Fluctuations around linguistic laws are larger than expected from simple assumptions.
Large fluctuations can lead to false rejection of laws if not properly modeled.
Linguistic laws impose constraints that are less strict than previously thought.
Abstract
Zipf's law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations).These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
