A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'
Ali Borji

TL;DR
This paper investigates why training generative models on their own synthetic data leads to collapse, revealing that such outcomes are statistical phenomena that may be unavoidable, based on theoretical analysis.
Contribution
It provides a theoretical explanation for model collapse caused by recursive training on synthetic data, using distribution fitting and sampling analysis.
Findings
Model collapse is a statistical phenomenon.
Repeated sampling from fitted distributions can lead to collapse.
Collapse may be an unavoidable consequence of recursive training.
Abstract
The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification
