Language Generation with Replay: A Learning-Theoretic View of Model Collapse
Giorgio Racca, Michal Valko, Amartya Sanyal

TL;DR
This paper provides a theoretical analysis of how replay of generated content can lead to model collapse in large language models, revealing conditions under which it limits or does not limit model performance.
Contribution
It introduces a learning-theoretic framework with a replay adversary to analyze the impact of reusing generated data on language model training, highlighting when replay causes fundamental limitations.
Findings
Replay limits model generation under weaker notions but not under uniform generation.
Theoretical results align with practical heuristics like data cleaning and watermarking.
Identifies scenarios where output filtering fails to prevent model collapse.
Abstract
As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Algorithms · Machine Learning and Data Classification
