Re-Evaluating Code LLM Benchmarks Under Semantic Mutation
Zhiyuan Pan, Xing Hu, Xin Xia, Xiaohu Yang

TL;DR
This paper investigates how small changes in prompt wording can significantly affect the evaluation results of large language models on code tasks, revealing issues with current benchmark reliability.
Contribution
It introduces a framework for analyzing prompt sensitivity in code benchmarks and provides extensive empirical evidence across multiple tasks and models.
Findings
Prompt variations cause significant performance shifts.
Model rankings are inconsistent across different prompts.
Prompt sensitivity impacts benchmark reliability.
Abstract
In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typically rely on a single prompt template per task, they are prone to the issue of prompt sensitivity, where minor prompt variations could result in substantial performance variations, leading to unreliable evaluations of model capabilities. While previous studies have explored prompt sensitivity, their experimental designs and findings are limited to traditional natural language processing (NLP) tasks. In this paper, we present an empirical study to investigate prompt sensitivity in code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Model-Driven Software Engineering Techniques
