Training on the Benchmark Is Not All You Need

Shiwen Ni; Xiangtao Kong; Chengming Li; Xiping Hu; Ruifeng Xu; Jia; Zhu; Min Yang

arXiv:2409.01790·cs.CL·March 3, 2025

Training on the Benchmark Is Not All You Need

Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia, Zhu, Min Yang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces a simple, effective method to detect data leakage in large language models' pre-training datasets by analyzing model responses to shuffled multiple-choice options, revealing significant leakage in popular open-source LLMs.

Contribution

The paper proposes a novel data leakage detection technique based on log probability analysis of shuffled options, applicable under gray-box conditions, and evaluates leakage levels across multiple models and benchmarks.

Findings

01

Effective detection of data leakage in LLMs' pre-training data.

02

Qwen LLMs exhibit the highest data leakage among tested models.

03

Method works without access to training data or model weights.

Abstract

The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nishiwen1214/Benchmark-leakage-detection
pytorchOfficial

Models

🤗
OrionStarAI/Orion-MoE8x7B
model· 10 dl· ♡ 2
10 dl♡ 2

Videos

Training on the Benchmark Is Not All You Need· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy

MethodsSparse Evolutionary Training