LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks
Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He,, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, David Lo

TL;DR
This study investigates data leakage in 83 software engineering benchmarks used for evaluating large language models, revealing generally low leakage but highlighting specific benchmarks with high bias, and introduces LessLeak-Bench to improve evaluation reliability.
Contribution
First large-scale analysis of data leakage in SE benchmarks for LLMs, identifying causes, impact, and proposing a new benchmark to mitigate leakage effects.
Findings
Average leakage ratios are low across most benchmarks.
Some benchmarks like QuixBugs have 100% leakage.
Removing leaked samples improves evaluation reliability.
Abstract
Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. However, their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage, where the evaluation benchmark data is unintentionally ``seen'' by LLMs during the model's construction phase. The data leakage issue could largely undermine the validity of LLM-based research and evaluations. Despite the increasing use of LLMs in the SE community, there is no comprehensive study that assesses the extent of data leakage in SE benchmarks for LLMs yet. To address this gap, this paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our results show that in general, data leakage in SE benchmarks is minimal, with average leakage ratios of only 4.8\%, 2.8\%, and 0.7\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Data Quality and Management · Digital and Cyber Forensics
