A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on   Recursively Generated Data'

Ali Borji

arXiv:2410.12954·cs.LG·October 28, 2024

A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'

Ali Borji

PDF

Open Access 1 Models

TL;DR

This paper investigates why training generative models on their own synthetic data leads to collapse, revealing that such outcomes are statistical phenomena that may be unavoidable, based on theoretical analysis.

Contribution

It provides a theoretical explanation for model collapse caused by recursive training on synthetic data, using distribution fitting and sampling analysis.

Findings

01

Model collapse is a statistical phenomenon.

02

Repeated sampling from fitted distributions can lead to collapse.

03

Collapse may be an unavoidable consequence of recursive training.

Abstract

The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification