Inference-Time Decontamination: Reusing Leaked Benchmarks for Large   Language Model Evaluation

Qin Zhu; Qingyuan Cheng; Runyu Peng; Xiaonan Li; Tengxiao; Liu; Ru Peng; Xipeng Qiu; Xuanjing Huang

arXiv:2406.13990·cs.CL·June 25, 2024·1 cites

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Qin Zhu, Qingyuan Cheng, Runyu Peng, Xiaonan Li, Tengxiao, Liu, Ru Peng, Xipeng Qiu, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Inference-Time Decontamination (ITD), a method to detect and rewrite leaked benchmark samples during evaluation, reducing performance inflation and providing more accurate assessments of large language models.

Contribution

The paper proposes ITD, a novel approach to mitigate benchmark leakage effects during LLM evaluation without needing new benchmarks.

Findings

01

ITD reduces inflated accuracy by 22.9% on GSM8K.

02

ITD decreases MMLU results of Phi3 and Mistral by 6.7% and 3.6%.

03

ITD offers more truthful evaluation results for LLMs.

Abstract

The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

8188zq/Inference-Time-Decontamination
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)