Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin,, Yaojie Lu, Xianpei Han, Le Sun

TL;DR
This paper introduces RACE, a comprehensive benchmark evaluating large language models' code generation across multiple dimensions beyond correctness, revealing their strengths and weaknesses in real-world coding scenarios.
Contribution
The paper proposes RACE, a multidimensional benchmark for code quality assessment, addressing limitations of correctness-only evaluations and enhancing understanding of LLMs' real-world coding capabilities.
Findings
Current benchmarks focus mainly on correctness, missing other quality aspects.
RACE effectively evaluates models across multiple code quality dimensions.
Even advanced LLMs struggle with complex, customized coding requirements.
Abstract
In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting other critical dimensions that also significantly impact code quality in real-world development. Moreover, relying exclusively on correctness as the guiding metric renders LLMs susceptible to data contamination. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsFocus
