Escaping Collapse: The Strength of Weak Data for Large Language Model Training

Kareem Amin; Sara Babakniya; Alex Bie; Weiwei Kong; Umar Syed; Sergei Vassilvitskii

arXiv:2502.08924·cs.LG·December 2, 2025

Escaping Collapse: The Strength of Weak Data for Large Language Model Training

Kareem Amin, Sara Babakniya, Alex Bie, Weiwei Kong, Umar Syed, Sergei Vassilvitskii

PDF

Open Access

TL;DR

This paper develops a theoretical framework inspired by boosting to determine the amount of data curation needed for continual improvement in large language model training with synthetic data, validated by experiments.

Contribution

It introduces a formal analysis of data curation in synthetic data training for LLMs, connecting boosting techniques to recent methods and proposing dynamic focusing strategies.

Findings

01

Proper curation prevents performance collapse

02

Focusing on challenging examples improves results

03

Theoretical insights align with experimental validation

Abstract

Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques