BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

DatologyAI: Pratyush Maini; Vineeth Dorna; Parth Doshi; Aldo Carranza; Fan Pan; Jack Urbanek; Paul Burstein; Alex Fang; Alvin Deng; Amro Abbas; Brett Larsen; Cody Blakeney; Charvi Bannur; Christina Baek; Darren Teh; David Schwab; Haakon Mongstad; Haoli Yin; Josh Wills; Kaleigh Mentzer; Luke Merrick; Ricardo Monti; Rishabh Adiga; Siddharth Joshi; Spandan Das; Zhengping Wang; Bogdan Gaza; Ari Morcos; Matthew Leavitt

arXiv:2508.10975·cs.LG·August 21, 2025

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

DatologyAI: Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills

PDF

Open Access

TL;DR

BeyondWeb introduces a high-quality synthetic data generation framework that significantly improves pretraining efficiency and performance for large language models, outperforming existing datasets and providing insights into synthetic data optimization.

Contribution

The paper presents BeyondWeb, a novel synthetic data generation framework that enhances pretraining datasets, outperforming state-of-the-art datasets and offering insights into factors affecting synthetic data quality.

Findings

01

BeyondWeb outperforms Cosmopedia and Nemotron-Synth in benchmark evaluations.

02

Training on BeyondWeb data yields faster training and better model performance.

03

Insights into data quality factors and their impact on pretraining effectiveness.

Abstract

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStructural Health Monitoring Techniques