Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach
Longtian Wang, Tianlin Li, Xiaofei Xie, Yuhan Zhi, Jian Wang, Chao Shen

TL;DR
This paper introduces mutation strategies to simulate real-world variations in problem descriptions for code generation, revealing significant discrepancies in model performance and highlighting the need for more robust benchmarks.
Contribution
It proposes 10 mutation strategies and three evaluation metrics to better assess code generation models under realistic, varied problem descriptions.
Findings
Significant performance gaps between existing benchmarks and mutated prompts.
Existing benchmarks may overestimate model robustness.
Mutations reveal model sensitivities to description variations.
Abstract
Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated benchmarks. However, there is a substantial gap between real-world scenarios and benchmark settings. Existing benchmarks typically provide only a single input prompt for the evaluation of each synthesis problem. However, in practice, a problem can be described in various ways, including with typos, where developers may struggle to understand certain descriptions and seek clarification to find more suitable wording. Such various descriptions may lead to variations in the performance of CLLMs on the same question, resulting in a biased evaluation when using existing benchmarks. In this paper, we aim to explore these pitfalls with the goal of revisiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Embedded Systems Design Techniques
