How Much Can We Forget about Data Contamination?

Sebastian Bordt; Suraj Srinivas; Valentyn Boreiko; Ulrike von Luxburg

arXiv:2410.03249·cs.LG·June 17, 2025

How Much Can We Forget about Data Contamination?

Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper investigates the impact of data contamination in large language model training, showing that modern models can forget contaminated data over time and that overfitting due to small-scale contamination is less severe than previously thought.

Contribution

It provides a quantitative analysis of data contamination effects and demonstrates that large-scale training can mitigate overfitting from benchmark data leakage.

Findings

01

Minor contamination leads to overfitting if data is small-scale.

02

Modern LLMs can forget contaminated data with sufficient training.

03

Weight decay influences forgetting faster than expected.

Abstract

The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tml-tuebingen/forgetting-contamination
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data

MethodsWeight Decay · LLaMA · Chinchilla · AdamW