The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov; Zakhar Shumaylov; Yiren Zhao; Yarin Gal; Nicolas; Papernot; Ross Anderson

arXiv:2305.17493·cs.LG·April 16, 2024·153 cites

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas, Papernot, Ross Anderson

PDF

Open Access 1 Repo 1 Models 2 Videos

TL;DR

This paper investigates how training models on generated data leads to irreversible degradation of original content diversity, a phenomenon termed Model Collapse, affecting various generative models including LLMs.

Contribution

It introduces the concept of Model Collapse, demonstrating its occurrence across multiple generative models and providing theoretical insights into this widespread issue.

Findings

01

Model Collapse causes loss of original content distribution tails.

02

The phenomenon is observed in VAEs, GMMs, and LLMs.

03

Training on generated data can irreversibly degrade model quality.

Abstract

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

syusuke9999/ModelCollapseSimulation
none

Models

🤗
svb01/fine-tuned-embedding-model
model· 2 dl· ♡ 1
2 dl♡ 1

Videos

Are ChatBots their own death? | Training on Generated Data Makes Models Forget – Paper explained· youtube

"You don't fine-tune your way to AGI" - Here's why. [Eiso Kant]· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Label Smoothing · Dropout · Residual Connection · Linear Warmup With Cosine Annealing · Linear Layer · Absolute Position Encodings