Rate of Model Collapse in Recursive Training
Ananda Theertha Suresh, Andrew Thangaraj, Aditya Nanda Kishore, Khandavally

TL;DR
This paper investigates how quickly models trained recursively on generated data tend to forget original data nuances, providing theoretical and experimental insights into the rate of model collapse for fundamental distributions.
Contribution
It offers the first theoretical characterization of the rate of model collapse in recursive training for discrete and Gaussian distributions under maximum likelihood estimation.
Findings
For discrete distributions, the time to forget a word is roughly proportional to its original frequency.
For Gaussian models, the standard deviation diminishes to zero after approximately n iterations.
Model forgetting takes a long time in these simple distribution settings.
Abstract
Given the ease of creating synthetic data from machine learning models, new models can be potentially trained on synthetic data generated by previous models. This recursive training process raises concerns about the long-term impact on model quality. As models are recursively trained on generated data from previous rounds, their ability to capture the nuances of the original human-generated data may degrade. This is often referred to as \emph{model collapse}. In this work, we ask how fast model collapse occurs for some well-studied distribution families under maximum likelihood (ML or near ML) estimation during recursive training. Surprisingly, even for fundamental distributions such as discrete and Gaussian distributions, the exact rate of model collapse is unknown. In this work, we theoretically characterize the rate of collapse in these fundamental settings and complement it with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Teaching Methodologies in Social Sciences
