A Survey on Data Contamination for Large Language Models

Yuxing Cheng; Yi Chang; Yuan Wu

arXiv:2502.14425·cs.CL·June 6, 2025·5 cites

A Survey on Data Contamination for Large Language Models

Yuxing Cheng, Yi Chang, Yuan Wu

PDF

Open Access 1 Repo

TL;DR

This survey reviews the issue of data contamination in evaluating Large Language Models, discussing its impacts, detection methods, and proposing future directions for more reliable assessment protocols.

Contribution

It provides a comprehensive overview of contamination issues, categorizes detection methods, and highlights strategies for contamination-free evaluation of LLMs.

Findings

01

Data contamination can inflate LLM performance metrics.

02

Dynamic benchmarks and LLM-driven evaluation methods are promising.

03

Detection approaches include white-Box, gray-Box, and black-Box methods.

Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data contamination-the unintended overlap between training and test datasets. This overlap has the potential to artificially inflate model performance, as LLMs are typically trained on extensive datasets scraped from publicly available sources. These datasets often inadvertently overlap with the benchmarks used for evaluation, leading to an overestimation of the models' true generalization capabilities. In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation, focusing on three strategies: data updating-based methods, data rewriting-based methods, and prevention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyucheng09/contamination_detector
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data