
TL;DR
This paper presents a mathematical framework analyzing how recursive reuse and selective filtering shape the evolution of public text corpora, impacting the richness and diversity of language models.
Contribution
It introduces an exactly solvable model for recursive text ecosystems, distinguishing effects of drift and selection on corpus structure and quality.
Findings
Drift causes rare forms to diminish in the corpus.
Selective filtering can sustain richer linguistic structures.
Optimal bounds are established for divergence from shallow equilibria.
Abstract
The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order -gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
