CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian, Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang,, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao, Huang, Zhaoxiang Zhang

TL;DR
CodeCriticBench is a comprehensive benchmark designed to evaluate the critique abilities of large language models across diverse code-related tasks and multiple evaluation dimensions.
Contribution
It introduces a holistic evaluation framework for LLM critique capacity, covering multiple code tasks and detailed assessment protocols, addressing limitations of prior benchmarks.
Findings
Existing benchmarks lack comprehensive code critique evaluation.
CodeCriticBench effectively evaluates critique abilities across tasks and dimensions.
Experimental results demonstrate the benchmark's effectiveness.
Abstract
The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Topic Modeling
MethodsSoftmax · Attention Is All You Need
