Drift and selection in LLM text ecosystems

S{\o}ren Riis

arXiv:2604.08554·cs.CL·April 13, 2026

Drift and selection in LLM text ecosystems

S{\o}ren Riis

PDF

TL;DR

This paper presents a mathematical framework analyzing how recursive reuse and selective filtering shape the evolution of public text corpora, impacting the richness and diversity of language models.

Contribution

It introduces an exactly solvable model for recursive text ecosystems, distinguishing effects of drift and selection on corpus structure and quality.

Findings

01

Drift causes rare forms to diminish in the corpus.

02

Selective filtering can sustain richer linguistic structures.

03

Optimal bounds are established for divergence from shallow equilibria.

Abstract

The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$ -gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.