CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Hongchao Jiang; Yiming Chen; Yushi Cao; Hung-yi Lee; Robby T. Tan

arXiv:2507.10535·cs.CL·August 15, 2025

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

PDF

Open Access 2 Datasets

TL;DR

This paper introduces CodeJudgeBench, a benchmark for evaluating LLMs acting as judges in coding tasks, revealing that recent thinking models outperform others but still face challenges in consistency and reliability.

Contribution

It presents the first dedicated benchmark for LLM-as-a-Judge in coding, compares 26 models, and explores prompting strategies to improve judgment accuracy.

Findings

01

Thinking models outperform non-thinking models in code judging.

02

Small thinking models can outperform larger, trained models.

03

Judgment accuracy is sensitive to response order and model variance.

Abstract

Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security