CodeCriticBench: A Holistic Code Critique Benchmark for Large Language   Models

Alexander Zhang; Marcus Dong; Jiaheng Liu; Wei Zhang; Yejie Wang; Jian; Yang; Ge Zhang; Tianyu Liu; Zhongyuan Peng; Yingshui Tan; Yuanxing Zhang,; Zhexu Wang; Weixun Wang; Yancheng He; Ken Deng; Wangchunshu Zhou; Wenhao; Huang; Zhaoxiang Zhang

arXiv:2502.16614·cs.CL·February 25, 2025

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian, Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang,, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao, Huang, Zhaoxiang Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CodeCriticBench is a comprehensive benchmark designed to evaluate the critique abilities of large language models across diverse code-related tasks and multiple evaluation dimensions.

Contribution

It introduces a holistic evaluation framework for LLM critique capacity, covering multiple code tasks and detailed assessment protocols, addressing limitations of prior benchmarks.

Findings

01

Existing benchmarks lack comprehensive code critique evaluation.

02

CodeCriticBench effectively evaluates critique abilities across tasks and dimensions.

03

Experimental results demonstrate the benchmark's effectiveness.

Abstract

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

multimodal-art-projection/CodeCriticBench
none

Datasets

m-a-p/CodeCriticBench
dataset· 65 dl
65 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Topic Modeling

MethodsSoftmax · Attention Is All You Need