Relating Zipf's law to textual information

Weibing Deng; Armen E. Allahverdyan

arXiv:1809.08399·cs.CL·September 25, 2018

Relating Zipf's law to textual information

Weibing Deng, Armen E. Allahverdyan

PDF

Open Access

TL;DR

This paper investigates how Zipf's law relates to meaningful textual information by comparing the first and second halves of texts, revealing differences that distinguish meaningful texts from random word sequences.

Contribution

It demonstrates that Zipf's law applies differently to text halves, linking the law to textual information and uncovering new hierarchical organization features.

Findings

01

Zipf's law applies better to the first half of texts.

02

Words following Zipf's law are more homogeneously distributed.

03

The first half of texts is lexically richer and more diverse.

Abstract

Zipf's law is the main regularity of quantitative linguistics. Despite of many works devoted to foundations of this law, it is still unclear whether it is only a statistical regularity, or it has deeper relations with information-carrying structures of the text. This question relates to that of distinguishing a meaningful text (written in an unknown system) from a meaningless set of symbols that mimics statistical features of a text. Here we contribute to resolving these questions by comparing features of the first half of a text (from the beginning to the middle) to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre, author's vocabulary {\it etc}). In all studied texts we saw that for the first half Zipf's law applies from smaller ranks than in the second half, i.e. the law applies better to the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Advanced Text Analysis Techniques · Natural Language Processing Techniques