How Much Can We Forget about Data Contamination?
Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg

TL;DR
This paper investigates the impact of data contamination in large language model training, showing that modern models can forget contaminated data over time and that overfitting due to small-scale contamination is less severe than previously thought.
Contribution
It provides a quantitative analysis of data contamination effects and demonstrates that large-scale training can mitigate overfitting from benchmark data leakage.
Findings
Minor contamination leads to overfitting if data is small-scale.
Modern LLMs can forget contaminated data with sufficient training.
Weight decay influences forgetting faster than expected.
Abstract
The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
MethodsWeight Decay · LLaMA · Chinchilla · AdamW
