CAP: Data Contamination Detection via Consistency Amplification

Yi Zhao; Jing Li; Linyi Yang

arXiv:2410.15005·cs.CL·October 22, 2024

CAP: Data Contamination Detection via Consistency Amplification

Yi Zhao, Jing Li, Linyi Yang

PDF

Open Access

TL;DR

The paper introduces CAP, a novel framework for detecting data contamination in large language models by measuring dataset leakage through consistency amplification, applicable across various models and benchmarks.

Contribution

CAP is the first method to explicitly differentiate between fine-tuning and contamination, enhancing detection accuracy in domain-specific models.

Findings

01

CAP effectively detects contamination across seven LLMs.

02

Composite benchmarks are highly prone to unintentional contamination.

03

CAP works for both white-box and black-box models.

Abstract

Large language models (LLMs) are widely used, but concerns about data contamination challenge the reliability of LLM evaluations. Existing contamination detection methods are often task-specific or require extra prerequisites, limiting practicality. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency. To the best of our knowledge, this is the first method to explicitly differentiate between fine-tuning and contamination, which is crucial for detecting contamination in domain-specific models. Additionally, CAP is applicable to various benchmarks and works for both white-box and black-box models. We validate CAP's effectiveness through experiments on seven LLMs and four domain-specific benchmarks. Our findings also show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Advanced Data Storage Technologies · Security and Verification in Computing