Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Max Schaffelder, Albert Gatt

TL;DR
This study examines how the diversity of synthetic data sources affects large language model fine-tuning, impacting distribution, robustness, and bias, with implications for safety and usability.
Contribution
It provides empirical insights into how synthetic data diversity influences model behavior, including distribution preservation, bias reduction, and safety considerations.
Findings
Diverse synthetic data mitigates distribution collapse.
Synthetic fine-tuning can produce higher quality outputs.
Fine-tuning reduces self-preference bias, especially with human data.
Abstract
As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, we observe a tendency for higher output quality in the latter case, thus making outputs potentially more usable and dangerous. Finally, we also find evidence that fine-tuning reduces self-preference bias, with human data being the most effective, followed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
