Data Contamination Can Cross Language Barriers

Feng Yao; Yufan Zhuang; Zihao Sun; Sunan Xu; Animesh Kumar; Jingbo; Shang

arXiv:2406.13236·cs.CL·October 31, 2024

Data Contamination Can Cross Language Barriers

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo, Shang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals a new form of cross-lingual contamination in large language models that evades current detection methods, proposes generalization-based approaches to detect it, and discusses its implications for model interpretation and multilingual capabilities.

Contribution

It introduces a novel cross-lingual contamination form, develops generalization-based detection methods, and explores their applications in understanding and enhancing LLMs.

Findings

01

Cross-lingual contamination can fool existing detection methods.

02

Modified benchmarks reveal models' inability to generalize when contaminated.

03

Proposed methods effectively detect deeply concealed contamination.

Abstract

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shangdatalab/deep-contam
pytorchOfficial

Videos

Data Contamination Can Cross Language Barriers· underline

Taxonomy

TopicsPrivacy-Preserving Technologies in Data