CodeJudge-Eval: Can Large Language Models be Good Judges in Code   Understanding?

Yuwei Zhao; Ziyang Luo; Yuchen Tian; Hongzhan Lin; Weixiang Yan; Annan; Li; Jing Ma

arXiv:2408.10718·cs.SE·September 16, 2024

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan, Li, Jing Ma

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces CodeJudge-Eval, a new benchmark to evaluate large language models' understanding of code by judging correctness, revealing that even top models have significant limitations in this aspect.

Contribution

The paper presents a novel benchmark, CJ-Eval, focusing on code judging rather than generation, and evaluates 12 models to highlight their shortcomings in code understanding.

Findings

01

State-of-the-art models struggle with code judging tasks.

02

CJ-Eval captures deeper code understanding abilities.

03

Traditional benchmarks may overestimate models' capabilities.

Abstract

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codellm-research/codejudge-eval
noneOfficial

Datasets

CodeResearch/CodeJudge-Eval
dataset· 281 dl
281 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Natural Language Processing Techniques

MethodsSparse Evolutionary Training