Quantifying Contamination in Evaluating Code Generation Capabilities of   Language Models

Martin Riddell; Ansong Ni; Arman Cohan

arXiv:2403.04811·cs.SE·March 11, 2024·1 cites

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Martin Riddell, Ansong Ni, Arman Cohan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how data contamination from training datasets affects the evaluation of large language models in code generation, revealing significant overlaps and their impact on model performance.

Contribution

It provides a comprehensive analysis of benchmark contamination, quantifies overlaps with training data, and examines factors influencing model memorization and generalization.

Findings

01

Substantial overlap between benchmarks and training data.

02

Models perform better on contaminated benchmark subsets.

03

Factors like model size and question length affect memorization.

Abstract

While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yale-nlp/code-llm-contamination
noneOfficial

Videos

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques