Complexity and universality in the long-range order of words
Marcelo A Montemurro, Dami\'an H Zanette

TL;DR
This paper reviews and extends quantitative methods to analyze the balance of order and disorder in language, revealing universal statistical features and linking semantic information to linguistic structure.
Contribution
It introduces a measure of relative entropy consistent across languages and demonstrates how information theory can extract semantic structures without prior language knowledge.
Findings
Relative entropy measure is nearly constant across linguistic families.
Information theory can identify semantic structures in language samples.
Language exhibits a universal statistical structure close to 3.5 bits/word.
Abstract
As is the case of many signals produced by complex systems, language presents a statistical structure that is balanced between order and disorder. Here we review and extend recent results from quantitative characterisations of the degree of order in linguistic sequences that give insights into two relevant aspects of language: the presence of statistical universals in word ordering, and the link between semantic information and the statistical linguistic structure. We first analyse a measure of relative entropy that assesses how much the ordering of words contributes to the overall statistical structure of language. This measure presents an almost constant value close to 3.5 bits/word across several linguistic families. Then, we show that a direct application of information theory leads to an entropy measure that can quantify and extract semantic structures from linguistic samples, even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Machine Learning in Bioinformatics · Authorship Attribution and Profiling
