Entropy and type-token ratio in gigaword corpora

Pablo Rosillo-Rodes; Maxi San Miguel; David Sanchez

arXiv:2411.10227·cs.CL·July 16, 2025

Entropy and type-token ratio in gigaword corpora

Pablo Rosillo-Rodes, Maxi San Miguel, David Sanchez

PDF

Open Access

TL;DR

This study explores the relationship between entropy and type-token ratio across diverse large-scale linguistic datasets, revealing a functional link grounded in natural language statistical laws.

Contribution

It introduces an empirical and analytical relation between entropy and type-token ratio in large corpora, supported by data from multiple languages and genres.

Findings

01

Discovered an empirical functional relation between entropy and type-token ratio.

02

Derived an analytical expression based on Zipf and Heaps laws.

03

Validated the relation across multilingual and multi-genre corpora.

Abstract

There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling