On Leakage of Code Generation Evaluation Datasets
Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone,, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon,, Matthias Gall\'e

TL;DR
This paper investigates various forms of data leakage in code generation evaluation datasets, demonstrates their impact, and introduces a new uncontaminated benchmark called LBPP to improve evaluation reliability.
Contribution
It identifies three sources of dataset contamination in code generation evaluation and releases LBPP, a clean benchmark dataset for more accurate assessment.
Findings
Evidence of data leakage in existing datasets
Synthetic data can cause indirect leakage
Overfitting to evaluation sets affects model performance
Abstract
In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research · Advanced Malware Detection Techniques
