How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Joel Niklaus, Atsuki Yamaguchi, Michal \v{S}tef\'anik, Guilherme Penedo, Hynek Kydl\'i\v{c}ek, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, Thomas Wolf

TL;DR
This study systematically evaluates how prompt design, generator models, and source data affect synthetic pretraining data quality, leading to the creation of a high-quality dataset that outperforms existing options.
Contribution
It provides comprehensive experimental insights into synthetic data generation, introduces a new dataset, and offers practical tools for the research community.
Findings
Structured output formats outperform web baselines.
Larger generator models (>1B) do not improve quality.
Source data selection significantly impacts performance.
Abstract
Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
