CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan, Li, Jing Ma

TL;DR
This paper introduces CodeJudge-Eval, a new benchmark to evaluate large language models' understanding of code by judging correctness, revealing that even top models have significant limitations in this aspect.
Contribution
The paper presents a novel benchmark, CJ-Eval, focusing on code judging rather than generation, and evaluates 12 models to highlight their shortcomings in code understanding.
Findings
State-of-the-art models struggle with code judging tasks.
CJ-Eval captures deeper code understanding abilities.
Traditional benchmarks may overestimate models' capabilities.
Abstract
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
