Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu, Ozlem Uzuner, Meliha Yetisgen, Fei Xia

TL;DR
This paper reviews and evaluates data contamination detection methods in large language models, revealing that many assumptions underlying these methods are unvalidated and that current detection approaches may often be ineffective or unreliable.
Contribution
It systematically categorizes assumptions in data contamination detection, tests key assumptions through case studies, and highlights limitations of current methods in real-world scenarios.
Findings
MIA approaches can perform no better than random guessing on certain datasets.
Current assumptions in detection methods are often unvalidated or invalid.
Detection methods fail under data distribution shifts.
Abstract
Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. However, as LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments. Multiple approaches have been developed to identify data contamination. These approaches rely on specific assumptions that may not hold universally across different settings. To bridge this gap, we systematically review 50 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated. We identify and analyze eight categories of assumptions and test three of them as case studies. Our case studies focus on detecting direct, instance-level data contamination,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Cloud Data Security Solutions · Big Data and Digital Economy
MethodsFocus
