Does your data spark joy? Performance gains from domain upsampling at   the end of training

Cody Blakeney; Mansheej Paul; Brett W. Larsen; Sean Owen; and Jonathan; Frankle

arXiv:2406.03476·cs.LG·June 6, 2024

Does your data spark joy? Performance gains from domain upsampling at the end of training

Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, and Jonathan, Frankle

PDF

Open Access 1 Datasets

TL;DR

This paper demonstrates that end-of-training domain upsampling of smaller datasets significantly improves performance on difficult benchmarks for large language models, offering a cost-effective way to optimize pretraining data composition.

Contribution

It introduces a simple end-of-training upsampling technique for domain-specific data that enhances model performance and allows scalable dataset utility analysis.

Findings

01

Upsampling domain data at training end improves benchmark scores by up to 8.26 pp.

02

Optimal upsampling proportion is between 10% and 20%.

03

The method rivals longer training models with less computational cost.

Abstract

Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

stefan-it/nanochat-german-eval-data
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsBalanced Selection