The Curse of Recursion: Training on Generated Data Makes Models Forget
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas, Papernot, Ross Anderson

TL;DR
This paper investigates how training models on generated data leads to irreversible degradation of original content diversity, a phenomenon termed Model Collapse, affecting various generative models including LLMs.
Contribution
It introduces the concept of Model Collapse, demonstrating its occurrence across multiple generative models and providing theoretical insights into this widespread issue.
Findings
Model Collapse causes loss of original content distribution tails.
The phenomenon is observed in VAEs, GMMs, and LLMs.
Training on generated data can irreversibly degrade model quality.
Abstract
Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Are ChatBots their own death? | Training on Generated Data Makes Models Forget – Paper explained· youtube
"You don't fine-tune your way to AGI" - Here's why. [Eiso Kant]· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Label Smoothing · Dropout · Residual Connection · Linear Warmup With Cosine Annealing · Linear Layer · Absolute Position Encodings
