The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

Mohammed Latif Siddiq; Simantika Dristi; Joy Saha; Joanna C. S. Santos

arXiv:2404.10155·cs.SE·September 5, 2024·3 cites

The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

Mohammed Latif Siddiq, Simantika Dristi, Joy Saha, Joanna C. S. Santos

PDF

Open Access

TL;DR

This study critically evaluates the quality of prompts in code generation benchmarks, revealing prevalent issues and demonstrating that prompt improvements can enhance Python code generation performance, while also uncovering potential data contamination in models.

Contribution

It is the first comprehensive analysis of prompt quality in code generation benchmarks, highlighting common issues and their impact on model evaluation.

Findings

01

Benchmark prompts often contain spelling and grammatical errors.

02

Fixing prompt issues improves Python code generation performance.

03

Evidence of data contamination in GPT-3.5-Turbo and CodeGen-2.5 models.

Abstract

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks' prompts affects a model's performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark's trustworthiness. We found that code generation evaluation benchmarks mainly focused on Python and coding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Real-time simulation and control systems · Software Testing and Debugging Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dropout · Adam · Attention Is All You Need · Linear Layer · Layer Normalization · Weight Decay · Byte Pair Encoding · Multi-Head Attention