Is space a word, too?
Jake Ryland Williams, Giovanni C. Santia

TL;DR
This paper explores the connection between Zipf's law, Mandelbrot's refinement, and Simon's model by incorporating the role of space and punctuation, suggesting a universal process underlying word and non-word object distributions.
Contribution
It demonstrates how space and punctuation, as 'dark matter', unify Zipf-Mandelbrot law with Simon's mechanistic model, revealing a broader universal process in rank-frequency distributions.
Findings
Inclusion of space explains deviations from Zipf's law.
Mandelbrot's fudge factor accounts for exclusion of non-word objects.
Supports the idea that space is a word, too.
Abstract
For words, rank-frequency distributions have long been heralded for adherence to a potentially-universal phenomenon known as Zipf's law. The hypothetical form of this empirical phenomenon was refined by Ben\^{i}ot Mandelbrot to that which is presently referred to as the Zipf-Mandelbrot law. Parallel to this, Herbet Simon proposed a selection model potentially explaining Zipf's law. However, a significant dispute between Simon and Mandelbrot, notable empirical exceptions, and the lack of a strong empirical connection between Simon's model and the Zipf-Mandelbrot law have left the questions of universality and mechanistic generation open. We offer a resolution to these issues by exhibiting how the dark matter of word segmentation, i.e., space, punctuation, etc., connect the Zipf-Mandelbrot law to Simon's mechanistic process. This explains Mandelbrot's refinement as no more than a fudge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis
