Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks
Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, David Lo

TL;DR
This paper reviews 291 benchmarks for evaluating large language models in software engineering, analyzing their construction, limitations, and future directions to improve assessment tools.
Contribution
It provides a comprehensive analysis of existing SE benchmarks for LLMs, highlighting their strengths, weaknesses, and future research opportunities.
Findings
Many benchmarks exist for SE tasks involving LLMs.
Current benchmarks have notable limitations in scope and design.
Future benchmarks should address identified challenges for better evaluation.
Abstract
Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 291 benchmarks, addressing three main aspects: what benchmarks are available, how benchmarks are constructed, and the future outlook for these benchmarks. We begin by examining SE tasks such as requirements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
