Generalization or Memorization: Data Contamination and Trustworthy   Evaluation for Large Language Models

Yihong Dong; Xue Jiang; Huanyu Liu; Zhi Jin; Bin Gu; Mengfei Yang; and; Ge Li

arXiv:2402.15938·cs.CL·June 3, 2024·3 cites

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and, Ge Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces methods for detecting and mitigating data contamination in large language models using output distribution analysis, improving evaluation trustworthiness and revealing contamination issues in models like ChatGPT.

Contribution

The paper proposes CDD for contamination detection and TED for trustworthy evaluation, along with two new benchmarks, addressing challenges of data contamination in LLMs.

Findings

01

CDD improves detection accuracy by up to 30%

02

TED reduces contamination-related performance gains by 66.9%

03

ChatGPT shows significant contamination susceptibility on HumanEval

Abstract

Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yihongdong/cdd-ted4llms
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)