Towards the quantification of the semantic information encoded in written language
Marcelo A. Montemurro, Damian Zanette

TL;DR
This paper applies information theory to analyze written language, revealing a characteristic segment size of a few thousand words that encodes most semantic information, with key words closely tied to main topics.
Contribution
It introduces a method to quantify semantic content in text using information theory and identifies a typical segment size that captures the most informative parts of language.
Findings
A characteristic scale of around a few thousand words for informative segments.
Words with higher information contribution are linked to main subjects.
Semantic information distribution follows a domain model with localized high-frequency word regions.
Abstract
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
