A Paradoxical Property of the Monkey Book
Sebastian Bernhardsson, Seung Ki Baek, Petter Minnhagen

TL;DR
This paper investigates the statistical properties of 'monkey books'—random letter sequences—and finds they surprisingly obey Heaps' law, unlike real books, revealing a paradoxical relationship between word distribution and frequency.
Contribution
It demonstrates that monkey books, despite their randomness, follow Heaps' law precisely, challenging assumptions about the relationship between word-frequency distributions and vocabulary growth.
Findings
Monkey books obey Heaps' law accurately.
Word distribution in monkey books differs from real books.
Contradicts expectation that power-law frequency implies power-law vocabulary growth.
Abstract
A "monkey book" is a book consisting of a random distribution of letters and blanks, where a group of letters surrounded by two blanks is defined as a word. We compare the statistics of the word distribution for a monkey book with the corresponding distribution for the general class of random books, where the latter are books for which the words are randomly distributed. It is shown that the word distribution statistics for the monkey book is different and quite distinct from a typical sampled book or real book. In particular the monkey book obeys Heaps' power law to an extraordinary good approximation, in contrast to the word distributions for sampled and real books, which deviate from Heaps' law in a characteristics way. The somewhat counter-intuitive conclusion is that a "monkey book" obeys Heaps' power law precisely because its word-frequency distribution is not a smooth power law,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
