Universal and non-universal text statistics: Clustering coefficient for   language identification

Diego Espitia; Hern\'an Larralde

arXiv:1911.08915·physics.soc-ph·July 15, 2020

Universal and non-universal text statistics: Clustering coefficient for language identification

Diego Espitia, Hern\'an Larralde

PDF

TL;DR

This paper investigates statistical properties of texts in multiple languages, confirming universal laws like Zipf's, and introduces clustering coefficient analysis of word networks to differentiate languages and identify randomness.

Contribution

It demonstrates that clustering coefficients in word co-occurrence networks can distinguish between different natural languages and random texts, extending understanding of language-specific network properties.

Findings

01

Zipf and Herdan-Heap's laws hold across languages.

02

Degree distribution is universal in word networks.

03

Clustering coefficients vary with language and randomness.

Abstract

In this work we analyze statistical properties of 91 relatively small texts in 7 different languages (Spanish, English, French, German, Turkish, Russian, Icelandic) as well as texts with randomly inserted spaces. Despite the size (around 11260 different words), the well known universal statistical laws -- namely Zipf and Herdan-Heap's laws -- are confirmed, and are in close agreement with results obtained elsewhere. We also construct a word co-occurrence network of each text. While the degree distribution is again universal, we note that the distribution of Clustering Coefficients, which depend strongly on the local structure of networks, can be used to differentiate between languages, as well as to distinguish natural languages from random texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.