Heaps' law, statistics of shared components and temporal patterns from a sample-space-reducing process
Andrea Mazzolini, Alberto Colliva, Michele Caselle, Matteo Osella

TL;DR
This paper demonstrates that sample-space-reducing processes can jointly produce Zipf's law, Heap's law, and shared component statistics, offering a simple yet comprehensive model for complex systems like language.
Contribution
It shows analytically and through simulations that SSR mechanisms can generate multiple statistical patterns observed in component systems, and compares SSR with other models like preferential attachment.
Findings
SSR models produce Zipf's and Heap's laws simultaneously.
SSR's temporal distribution differs from rich-get-richer models.
Empirical data suggests SSR better models language component statistics.
Abstract
Zipf's law is a hallmark of several complex systems with a modular structure, such as books composed by words or genomes composed by genes. In these component systems, Zipf's law describes the empirical power law distribution of component frequencies. Stochastic processes based on a sample-space-reducing (SSR) mechanism, in which the number of accessible states reduces as the system evolves, have been recently proposed as a simple explanation for the ubiquitous emergence of this law. However, many complex component systems are characterized by other statistical patterns beyond Zipf's law, such as a sublinear growth of the component vocabulary with the system size, known as Heap's law, and a specific statistics of shared components. This work shows, with analytical calculations and simulations, that these statistical properties can emerge jointly from a SSR mechanism, thus making it an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
