Does your data spark joy? Performance gains from domain upsampling at the end of training
Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, and Jonathan, Frankle

TL;DR
This paper demonstrates that end-of-training domain upsampling of smaller datasets significantly improves performance on difficult benchmarks for large language models, offering a cost-effective way to optimize pretraining data composition.
Contribution
It introduces a simple end-of-training upsampling technique for domain-specific data that enhances model performance and allows scalable dataset utility analysis.
Findings
Upsampling domain data at training end improves benchmark scores by up to 8.26 pp.
Optimal upsampling proportion is between 10% and 20%.
The method rivals longer training models with less computational cost.
Abstract
Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsBalanced Selection
