Text mixing shapes the anatomy of rank-frequency distributions: A modern   Zipfian mechanics for natural language

Jake Ryland Williams; James P. Bagrow; Christopher M. Danforth; and; Peter Sheridan Dodds

arXiv:1409.3870·cs.CL·May 27, 2015

Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Jake Ryland Williams, James P. Bagrow, Christopher M. Danforth, and, Peter Sheridan Dodds

PDF

TL;DR

This paper proposes that the two scaling regimes observed in Zipf's law across large corpora are primarily caused by text mixing effects, rather than core and non-core lexica separation, supported by empirical analysis of multiple languages.

Contribution

It introduces a novel hypothesis that text mixing explains the dual scaling regimes in Zipf's law, challenging previous core/non-core lexicon explanations.

Findings

01

Text mixing causes effective decay of word introduction.

02

Predictions of scaling breaks are accurate based on mixing effects.

03

Empirical evidence from 10 languages supports universality of the hypothesis.

Abstract

Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf's law which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this `law' of ranks has been found to hold across disparate texts and forms of data, analyses of increasingly large corpora over the last 15 years have revealed the existence of two scaling regimes. These regimes have thus far been explained by a hypothesis suggesting a separability of languages into core and non-core lexica. Here, we present and defend an alternative hypothesis, that the two scaling regimes result from the act of aggregating texts. We observe that text mixing leads to an effective decay of word introduction, which we show provides accurate predictions of the location and severity of breaks in scaling. Upon examining large corpora from 10 languages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.