On Leakage of Code Generation Evaluation Datasets

Alexandre Matton; Tom Sherborne; Dennis Aumiller; Elena Tommasone,; Milad Alizadeh; Jingyi He; Raymond Ma; Maxime Voisin; Ellen Gilsenan-McMahon,; Matthias Gall\'e

arXiv:2407.07565·cs.CL·October 4, 2024

On Leakage of Code Generation Evaluation Datasets

Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone,, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon,, Matthias Gall\'e

PDF

Open Access

TL;DR

This paper investigates various forms of data leakage in code generation evaluation datasets, demonstrates their impact, and introduces a new uncontaminated benchmark called LBPP to improve evaluation reliability.

Contribution

It identifies three sources of dataset contamination in code generation evaluation and releases LBPP, a clean benchmark dataset for more accurate assessment.

Findings

01

Evidence of data leakage in existing datasets

02

Synthetic data can cause indirect leakage

03

Overfitting to evaluation sets affects model performance

Abstract

In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research · Advanced Malware Detection Techniques