LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83   Software Engineering Benchmarks

Xin Zhou; Martin Weyssow; Ratnadira Widyasari; Ting Zhang; Junda He,; Yunbo Lyu; Jianming Chang; Beiqi Zhang; Dan Huang; David Lo

arXiv:2502.06215·cs.SE·February 11, 2025·2 cites

LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks

Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He,, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, David Lo

PDF

Open Access

TL;DR

This study investigates data leakage in 83 software engineering benchmarks used for evaluating large language models, revealing generally low leakage but highlighting specific benchmarks with high bias, and introduces LessLeak-Bench to improve evaluation reliability.

Contribution

First large-scale analysis of data leakage in SE benchmarks for LLMs, identifying causes, impact, and proposing a new benchmark to mitigate leakage effects.

Findings

01

Average leakage ratios are low across most benchmarks.

02

Some benchmarks like QuixBugs have 100% leakage.

03

Removing leaked samples improves evaluation reliability.

Abstract

Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. However, their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage, where the evaluation benchmark data is unintentionally ``seen'' by LLMs during the model's construction phase. The data leakage issue could largely undermine the validity of LLM-based research and evaluations. Despite the increasing use of LLMs in the SE community, there is no comprehensive study that assesses the extent of data leakage in SE benchmarks for LLMs yet. To address this gap, this paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our results show that in general, data leakage in SE benchmarks is minimal, with average leakage ratios of only 4.8\%, 2.8\%, and 0.7\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions · Data Quality and Management · Digital and Cyber Forensics