Strong Model Collapse

Elvis Dohmatob; Yunzhen Feng; Arjun Subramonian; Julia Kempe

arXiv:2410.04840·cs.LG·October 10, 2024·3 cites

Strong Model Collapse

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that even minimal synthetic data can cause severe model collapse in large neural networks, and that increasing model size can both worsen and, beyond a threshold, mitigate this phenomenon.

Contribution

It establishes the existence of a strong form of model collapse due to synthetic data and analyzes how model size influences this effect through theoretical and empirical methods.

Findings

01

Small synthetic data fractions cause model collapse.

02

Larger models can amplify collapse in simplified regimes.

03

Beyond the interpolation threshold, larger models may reduce collapse.

Abstract

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse.…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

- The paper is very well written; the problem statement and the contributions are clearly established. - The theoretical results seem sound and intriguing, and I appreciate the authors efforts to present simplified examples to help the reader understand the implications better. - The findings seem to transfer to practical learning settings fairly well.

Weaknesses

- Model Assumptions: I am unsure of the modelling of synthetic data and how well it translates into practice. Specifically, the authors model synthetic data using a label shift, assuming that the data (X) marginal remains the same. However, it seems unrealistic for autoregressive training (a key experiment in the paper), where the input tokens for next token generation come from the synthetic distribution. - Experimental Details: The theoretical results establish a strong dependence on the qual

Reviewer 02Rating 8Confidence 3

Strengths

- The paper shows model collapse in theoretical settings of the linear and random projection models - The paper shows nice alignment between the experiments and theoretical analysis (eg Fig 3)

Weaknesses

I'm a bit unclear on the takeaway from the experimental results with GPT2. Specifically, the paper seems to make mixed claims for the case of large models beyond the interpolation threshold; it is said that large models may mitigate the collapse beyond the interpolation threshold and that large models tend to amplify collapse beyond the interpolation threshold in the experimental results.

Reviewer 03Rating 8Confidence 4

Strengths

- The problem tackled by the authors is of great interest for the training of large-scale generative models. - I find the derivation of the proof elegant and the usage of the OVFPT novel in this setting. - The derived results are insightful and well-analyzed by the authors, along with the empirical validation on synthetic data that helps to convey their implications to real-world scenarios. - The authors conduct experiments on real data with (small) neural networks and LLMs that corroborates the

Weaknesses

1) Could the authors detail more the key steps/ideas of the proofs in the main paper (paragraph **Proving Theorem 1**)? I believe such tools could be beneficial to the reader, even in other applications. 2) Could the authors discuss the estimation of the quality of synthetic data (parameter $c^2$) on the real applications with MNIST and BabiStories? How would one assess it when training a large-scale model like an LLM or a VLM? 3) Could the authors discuss in more length the "cost" of approxima

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Time Series Analysis · Mathematical Biology Tumor Growth · Global Energy and Sustainability Research