Zipf's law for word frequencies: word forms versus lemmas in long texts
Alvaro Corral, Gemma Boleda, Ramon Ferrer-i-Cancho

TL;DR
This study examines whether Zipf's law applies equally to word forms and lemmas across long texts in multiple languages, finding similar exponents but less stable low-frequency cut-offs after lemmatization.
Contribution
It provides the first detailed analysis of Zipf's law for lemmas versus word forms in long, multilingual texts, highlighting the stability of the law's parameters across transformations.
Findings
Zipf's law holds for both word forms and lemmas across languages.
Exponents of Zipf's law are similar before and after lemmatization.
Low-frequency cut-offs are less stable after lemmatization.
Abstract
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. In order to have as homogeneous sources as possible, we analyze some of the longest literary texts ever written, comprising four different languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
