Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim; Suhas Kotha; Yejin Choi; Tatsunori Hashimoto; Nick Haber; Percy Liang

arXiv:2603.18534·cs.LG·March 20, 2026

Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang

PDF

Open Access

TL;DR

This paper demonstrates that synthetic data augmentation, especially using megadoc construction, significantly improves loss scaling and data efficiency in pre-training models, benefiting from increased synthetic data generation.

Contribution

It introduces novel methods for synthetic data generation, including megadoc construction, that enhance loss scaling and data efficiency in pre-training models.

Findings

01

Synthetic data improves validation loss despite distribution differences.

02

Megadoc construction increases data efficiency from 1.48x to 1.80x.

03

Megadocs outperform simple rephrasing, especially with more synthetic data.

Abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48 \times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Machine Learning and Data Classification · Web Data Mining and Analysis