The Fault in our Stars: Quality Assessment of Code Generation Benchmarks
Mohammed Latif Siddiq, Simantika Dristi, Joy Saha, Joanna C. S. Santos

TL;DR
This study critically evaluates the quality of prompts in code generation benchmarks, revealing prevalent issues and demonstrating that prompt improvements can enhance Python code generation performance, while also uncovering potential data contamination in models.
Contribution
It is the first comprehensive analysis of prompt quality in code generation benchmarks, highlighting common issues and their impact on model evaluation.
Findings
Benchmark prompts often contain spelling and grammatical errors.
Fixing prompt issues improves Python code generation performance.
Evidence of data contamination in GPT-3.5-Turbo and CodeGen-2.5 models.
Abstract
Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks' prompts affects a model's performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark's trustworthiness. We found that code generation evaluation benchmarks mainly focused on Python and coding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Real-time simulation and control systems · Software Testing and Debugging Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dropout · Adam · Attention Is All You Need · Linear Layer · Layer Normalization · Weight Decay · Byte Pair Encoding · Multi-Head Attention
