Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser,, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

TL;DR
This paper investigates the effects of synthetic data in generative model training, demonstrating that models can avoid collapse when synthetic data is accumulated with real data or constrained in size, highlighting conditions for stability.
Contribution
The study systematically compares different training workflows involving synthetic data, confirming conditions under which generative models remain stable or collapse.
Findings
Replacing real data with synthetic data causes collapse.
Accumulating synthetic with real data maintains stability.
Using fixed-size synthetic data subsets leads to gradual degradation.
Abstract
What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
