Reusing Pre-Training Data at Test Time is a Compute Multiplier

Alex Fang; Thomas Voice; Ruoming Pang; Ludwig Schmidt; Tom Gunter

arXiv:2511.04234·cs.CL·November 7, 2025

Reusing Pre-Training Data at Test Time is a Compute Multiplier

Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter

PDF

Open Access 3 Reviews

TL;DR

This paper shows that test-time retrieval significantly enhances language model accuracy, acting as a compute multiplier, and reveals that current pre-training methods underutilize available dataset information.

Contribution

The study quantifies the dataset value left unused by pre-training and demonstrates how retrieval at test time can substantially improve model performance across multiple benchmarks.

Findings

01

Retrieval at test time acts as a ~5x compute multiplier for accuracy.

02

Additional compute for parsing retrieved context yields a 10% improvement on MMLU.

03

Pre-training methods leave significant dataset information untapped, indicating room for progress.

Abstract

Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Their observation if able to be held to a larger scale might be interesting and could have some valuable guidance to the community.

Weaknesses

1. The scale of the experiment is limited. It seems less surprising if smaller models could not fully utilize the pretraining data. 2. If the authors want to propose this as a method to be used rather than just an experiment for some observation, then they should compare it with other test time scaling methods.

Reviewer 02Rating 6Confidence 3

Strengths

- The design (pre‑train on a corpus, then retrieve from exactly that corpus at test time) probes how much of the corpus’ information is not captured parametrically. This bridges two active lines of work (RAG and inference‑time compute) in a controlled, data‑centric way. - Paper does not hand‑wave contamination: it (a) quantifies overlap, (b) shows that decontaminated retrieval still helps (substantially for MMLU), and (c) surfaces the surprisingly large portion of benchmark content present in co

Weaknesses

- MMLU has documented label and quality issues; using it as the sole basis to translate accuracy into compute multipliers risks reporting an artifact of the fit or of dataset flaws. - Decontamination by token n‑gram overlap (16‑gram for MMLU, 26‑gram for Math‑500) is a good start but cannot remove paraphrastic or templatic leakage. Because the retrieval store is identical to pre‑training corpora, any residual overlap inflates the measured gap between “base” and “+retrieval.” - The “compute multi

Reviewer 03Rating 6Confidence 3

Strengths

The paper argues an interesting and highly-valuable point: well-done retrieval can comparatively benefit performance much more than additional generic model training. The idea of presenting gains in terms of "compute multipliers" is convincing (although the underlying simple sigmoid model deserves more attention) . The paper also provides evidence of additive benefits with other test-time procedures, which is promising. There is a massive amount of experiments here, supporting these findings and

Weaknesses

The main weakness of the paper is the general clarity re. exactly what is done. One blatant example is the Experimental Setup section, esp. 3.2: The short paragraph provides a straightforward and high-level description of the retrieval process, but no description of how the retrieved documents are used -- presumably a RAG-style process, but absolutely no detail is provided. Given that this is mainly an experimental paper, it is absolutely necessary that experimental details are provided, at mini

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification