TL;DR
SynPro is a novel synthetic data generation framework that enhances large language model pretraining by more effectively utilizing limited organic data through optimized rephrasing and reformatting, leading to significant scaling benefits.
Contribution
Introduces SynPro, a reinforcement learning-based synthetic data generation method that improves data utilization in LLM pretraining, surpassing standard repetition methods.
Findings
SynPro unlocks 3.7-5.2x more effective tokens than repetition.
Models pretrained with SynPro outperform non-data-bound oracle baselines.
Faithful, model-aware synthesis maintains data-bound scaling without distribution collapse.
Abstract
LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
