How to Synthesize Text Data without Model Collapse?

Xuekai Zhu; Daixuan Cheng; Hengli Li; Kaiyan Zhang; Ermo Hua; Xingtai Lv; Ning Ding; Zhouhan Lin; Zilong Zheng; Bowen Zhou

arXiv:2412.14689·cs.CL·May 29, 2025

How to Synthesize Text Data without Model Collapse?

Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou

PDF

Open Access

TL;DR

This paper investigates the impact of synthetic data on language model training, identifies issues like distributional shift and over-concentration, and proposes token editing of human data as a method to prevent model collapse and improve performance.

Contribution

It introduces a novel token editing technique on human data to synthesize semi-synthetic data, theoretically preventing model collapse during training.

Findings

01

Higher synthetic data proportions degrade model performance.

02

Token editing constrains test error with a finite upper bound.

03

Token editing improves performance across various training scenarios.

Abstract

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT- ${n}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus