A Survey on Data Contamination for Large Language Models
Yuxing Cheng, Yi Chang, Yuan Wu

TL;DR
This survey reviews the issue of data contamination in evaluating Large Language Models, discussing its impacts, detection methods, and proposing future directions for more reliable assessment protocols.
Contribution
It provides a comprehensive overview of contamination issues, categorizes detection methods, and highlights strategies for contamination-free evaluation of LLMs.
Findings
Data contamination can inflate LLM performance metrics.
Dynamic benchmarks and LLM-driven evaluation methods are promising.
Detection approaches include white-Box, gray-Box, and black-Box methods.
Abstract
Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data contamination-the unintended overlap between training and test datasets. This overlap has the potential to artificially inflate model performance, as LLMs are typically trained on extensive datasets scraped from publicly available sources. These datasets often inadvertently overlap with the benchmarks used for evaluation, leading to an overestimation of the models' true generalization capabilities. In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation, focusing on three strategies: data updating-based methods, data rewriting-based methods, and prevention-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
