BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
DatologyAI: Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills

TL;DR
BeyondWeb introduces a high-quality synthetic data generation framework that significantly improves pretraining efficiency and performance for large language models, outperforming existing datasets and providing insights into synthetic data optimization.
Contribution
The paper presents BeyondWeb, a novel synthetic data generation framework that enhances pretraining datasets, outperforming state-of-the-art datasets and offering insights into factors affecting synthetic data quality.
Findings
BeyondWeb outperforms Cosmopedia and Nemotron-Synth in benchmark evaluations.
Training on BeyondWeb data yields faster training and better model performance.
Insights into data quality factors and their impact on pretraining effectiveness.
Abstract
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStructural Health Monitoring Techniques
