Language Models Improve When Pretraining Data Matches Target Tasks

David Mizrahi; Anders Boesen Lindbo Larsen; Jesse Allardice; Suzie Petryk; Yuri Gorokhov; Jeffrey Li; Alex Fang; Josh Gardner; Tom Gunter; Afshin Dehghan

arXiv:2507.12466·cs.CL·July 17, 2025

Language Models Improve When Pretraining Data Matches Target Tasks

David Mizrahi, Anders Boesen Lindbo Larsen, Jesse Allardice, Suzie Petryk, Yuri Gorokhov, Jeffrey Li, Alex Fang, Josh Gardner, Tom Gunter, Afshin Dehghan

PDF

Open Access

TL;DR

This paper introduces BETR, a data selection method that improves language model performance by explicitly matching pretraining data to target benchmarks, demonstrating significant gains across various scales and tasks.

Contribution

We propose BETR, a benchmark-targeted ranking method that aligns pretraining data with evaluation benchmarks, enhancing model performance and generalization.

Findings

01

BETR achieves 2.1x compute efficiency over baseline.

02

BETR improves performance on 9 out of 10 tasks.

03

Larger models require less aggressive data filtering.

Abstract

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning $1 0^{19}$ to $1 0^{22}$ FLOPs and fitting scaling laws to them. From this, we find that simply aligning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques