What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning

Yuchang Zhu; Huazhen Zhong; Qunshu Lin; Haotong Wei; Xiaolong Sun; Zixuan Yu; Minghao Liu; Zibin Zheng; Liang Chen

arXiv:2506.19262·cs.CL·June 26, 2025

What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning

Yuchang Zhu, Huazhen Zhong, Qunshu Lin, Haotong Wei, Xiaolong Sun, Zixuan Yu, Minghao Liu, Zibin Zheng, Liang Chen

PDF

TL;DR

This paper investigates how the diversity of LLM-generated data influences the performance of downstream models, revealing that moderate diversity can improve results while excessive diversity may harm performance.

Contribution

It provides empirical analysis on the impact of data diversity in LLM-generated data for model fine-tuning, highlighting optimal diversity levels for better performance.

Findings

01

Moderately diverse LLM-generated data improves downstream model performance.

02

Highly diverse generated data can negatively affect model accuracy.

03

Minimal distribution shift is crucial for beneficial effects of synthetic data.

Abstract

With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.