Theoretical Proof that Auto-regressive Language Models Collapse when Real-world Data is a Finite Set

Lecheng Wang; Xianjie Shi; Ge Li; Jia Li; Xuanming Zhang; Yihong Dong; Wenpin Jiao; Hong Mei

arXiv:2412.14872·cs.CL·May 20, 2025

Theoretical Proof that Auto-regressive Language Models Collapse when Real-world Data is a Finite Set

Lecheng Wang, Xianjie Shi, Ge Li, Jia Li, Xuanming Zhang, Yihong Dong, Wenpin Jiao, Hong Mei

PDF

Open Access

TL;DR

This paper provides a theoretical proof that auto-regressive language models inevitably collapse when trained on a finite set of real-world data that becomes contaminated with generated data, emphasizing the importance of data quality.

Contribution

It offers a formal proof that LM collapse occurs when real data is replaced by generated data in a finite corpus, highlighting the limitations of quantity-based mitigation.

Findings

01

LM collapse is inevitable with finite real data once synthetic data is introduced

02

Limiting synthetic data quantity does not prevent collapse

03

Data quality is crucial to avoid model collapse

Abstract

Auto-regressive language models (LMs) have been widely used to generate data in data-scarce domains to train new LMs, compensating for the scarcity of real-world data. Previous work experimentally found that LMs collapse when trained on recursively generated data. This paper presents a theoretical proof: once a corpus (such as a subset of the World Wide Web) begins to incorporate generated data and no new real-world data is added to the corpus, then no matter how small the amount of data each LM generates and contributes to the corpus, LM collapse is inevitable after sufficient time. This finding suggests that attempts to mitigate collapse by limiting the quantity of synthetic data in the corpus are fundamentally insufficient. Instead, avoiding collapse hinges on ensuring the quality of synthetic data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques