Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification
Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe

TL;DR
This paper investigates the use of verification methods to prevent model collapse when training large language models on synthesized data, providing theoretical insights and practical experiments demonstrating effectiveness.
Contribution
It introduces a theoretical framework for using verifiers to select high-quality synthesized data and demonstrates their effectiveness in preventing model collapse in practical tasks.
Findings
Verifiers can effectively prevent model collapse with synthesized data.
Proposed proxy measure correlates strongly with model performance.
Experimental results show improved model stability using verification.
Abstract
Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Advanced Database Systems and Queries
MethodsPruning
