Universal and non-universal text statistics: Clustering coefficient for language identification
Diego Espitia, Hern\'an Larralde

TL;DR
This paper investigates statistical properties of texts in multiple languages, confirming universal laws like Zipf's, and introduces clustering coefficient analysis of word networks to differentiate languages and identify randomness.
Contribution
It demonstrates that clustering coefficients in word co-occurrence networks can distinguish between different natural languages and random texts, extending understanding of language-specific network properties.
Findings
Zipf and Herdan-Heap's laws hold across languages.
Degree distribution is universal in word networks.
Clustering coefficients vary with language and randomness.
Abstract
In this work we analyze statistical properties of 91 relatively small texts in 7 different languages (Spanish, English, French, German, Turkish, Russian, Icelandic) as well as texts with randomly inserted spaces. Despite the size (around 11260 different words), the well known universal statistical laws -- namely Zipf and Herdan-Heap's laws -- are confirmed, and are in close agreement with results obtained elsewhere. We also construct a word co-occurrence network of each text. While the degree distribution is again universal, we note that the distribution of Clustering Coefficients, which depend strongly on the local structure of networks, can be used to differentiate between languages, as well as to distinguish natural languages from random texts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
