Quantifying the Effect of Test Set Contamination on Generative Evaluations

Rylan Schaeffer; Joshua Kazdan; Baber Abbasi; Ken Ziyu Liu; Brando Miranda; Ahmed Ahmed; Fazl Berez; Abhay Puri; Stella Biderman; Niloofar Mireshghallah; Sanmi Koyejo

arXiv:2601.04301·cs.LG·February 9, 2026

Quantifying the Effect of Test Set Contamination on Generative Evaluations

Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Fazl Berez, Abhay Puri, Stella Biderman, Niloofar Mireshghallah, Sanmi Koyejo

PDF

Open Access

TL;DR

This paper investigates how test set contamination affects the evaluation of generative language models, revealing that even minimal contamination can significantly skew performance metrics and influence memorization during inference.

Contribution

It provides a comprehensive quantitative analysis of test set contamination effects on generative evaluations, including new insights into model scaling, overtraining, and inference factors.

Findings

01

Contamination improves performance with larger models and more contaminated data.

02

Single test set replica can lead to performance below the irreducible error.

03

High sampling temperatures and shorter outputs reduce memorization effects.

Abstract

As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Testing and Debugging Techniques · Software Engineering Research