Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Zichun Yu; Chenyan Xiong

arXiv:2605.17849·cs.CL·May 19, 2026

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Zichun Yu, Chenyan Xiong

PDF

1 Repo

TL;DR

SynPro is a novel synthetic data generation framework that enhances large language model pretraining by more effectively utilizing limited organic data through optimized rephrasing and reformatting, leading to significant scaling benefits.

Contribution

Introduces SynPro, a reinforcement learning-based synthetic data generation method that improves data utilization in LLM pretraining, surpassing standard repetition methods.

Findings

01

SynPro unlocks 3.7-5.2x more effective tokens than repetition.

02

Models pretrained with SynPro outperform non-data-bound oracle baselines.

03

Faithful, model-aware synthesis maintains data-bound scaling without distribution collapse.

Abstract

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cxcscmu/SynPro
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.