The meta book and size-dependent properties of written language
Sebastian Bernhardsson, Luis Enrique Correa da Rocha, Petter, Minnhagen

TL;DR
This paper investigates how the power-law index of written language varies with text length, proposing a meta book model that links word-frequency distributions to text length and challenges traditional Zipf's law assumptions.
Contribution
It introduces a systematic analysis of text-length dependence of the power-law index and proposes the meta book concept as an abstract representation of a text's word-frequency structure.
Findings
Gamma decreases from 2 to 1 with increasing text length
The infinite book limit is proposed to have gamma=1
A connection to an extended Heap's law is established
Abstract
Evidence is given for a systematic text-length dependence of the power-law index gamma of a single book. The estimated gamma values are consistent with a monotonic decrease from 2 to 1 with increasing length of a text. A direct connection to an extended Heap's law is explored. The infinite book limit is, as a consequence, proposed to be given by gamma = 1 instead of the value gamma=2 expected if the Zipf's law was ubiquitously applicable. In addition we explore the idea that the systematic text-length dependence can be described by a meta book concept, which is an abstract representation reflecting the word-frequency structure of a text. According to this concept the word-frequency distribution of a text, with a certain length written by a single author, has the same characteristics as a text of the same length pulled out from an imaginary complete infinite corpus written by the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
