A Tale of Tails: Model Collapse as a Change of Scaling Laws

Elvis Dohmatob; Yunzhen Feng; Pu Yang; Francois Charton; Julia; Kempe

arXiv:2402.07043·cs.LG·June 3, 2024·38 cites

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Julia, Kempe

PDF

Open Access

TL;DR

This paper investigates how the inclusion of synthetic data in training corpora affects neural scaling laws, revealing potential model collapse phenomena and providing a theoretical framework validated by large-scale experiments.

Contribution

It introduces a theoretical framework for understanding model collapse due to synthetic data, analyzing various decay phenomena in scaling laws.

Findings

01

Identification of decay phenomena in scaling laws

02

Analysis of loss of scaling and skill un-learning

03

Validation with large-scale transformer experiments

Abstract

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications