A Probabilistic Perspective on Model Collapse

Shirong Xu; Hengzhi He; Guang Cheng

arXiv:2505.13947·stat.ML·May 23, 2025

A Probabilistic Perspective on Model Collapse

Shirong Xu, Hengzhi He, Guang Cheng

PDF

Open Access

TL;DR

This paper offers a probabilistic framework to understand and prevent model collapse in language model training, emphasizing sample size growth and bias effects, supported by simulations and real data validation.

Contribution

It introduces a probabilistic perspective on model collapse, identifying conditions for prevention and analyzing the impact of bias and sample size growth in recursive training.

Findings

01

Increasing sample size at each step prevents collapse

02

Unbiased estimation requires superlinear sample growth

03

Synthetic data training can outperform real data in some cases

Abstract

In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning