Beyond Correctness: Benchmarking Multi-dimensional Code Generation for   Large Language Models

Jiasheng Zheng; Boxi Cao; Zhengzhao Ma; Ruotong Pan; Hongyu Lin,; Yaojie Lu; Xianpei Han; Le Sun

arXiv:2407.11470·cs.SE·October 10, 2024

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin,, Yaojie Lu, Xianpei Han, Le Sun

PDF

Open Access 2 Repos

TL;DR

This paper introduces RACE, a comprehensive benchmark evaluating large language models' code generation across multiple dimensions beyond correctness, revealing their strengths and weaknesses in real-world coding scenarios.

Contribution

The paper proposes RACE, a multidimensional benchmark for code quality assessment, addressing limitations of correctness-only evaluations and enhancing understanding of LLMs' real-world coding capabilities.

Findings

01

Current benchmarks focus mainly on correctness, missing other quality aspects.

02

RACE effectively evaluates models across multiple code quality dimensions.

03

Even advanced LLMs struggle with complex, customized coding requirements.

Abstract

In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, current benchmarks primarily assess the accuracy of LLM-generated code, while neglecting other critical dimensions that also significantly impact code quality in real-world development. Moreover, relying exclusively on correctness as the guiding metric renders LLMs susceptible to data contamination. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus