JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro, Cuadron, Chenguang Wang, Raluca Ada Popa, Ion Stoica

TL;DR
JudgeBench introduces a new benchmark for evaluating LLM-based judges on complex tasks, revealing that many models perform barely better than random, thus highlighting the need for more reliable evaluation methods.
Contribution
The paper proposes a novel evaluation framework and benchmark, JudgeBench, to objectively assess the performance of LLM-based judges on challenging tasks beyond simple preference alignment.
Findings
JudgeBench is more challenging than previous benchmarks.
Many strong models perform only slightly better than random.
The benchmark effectively reveals the limitations of current LLM-based judges.
Abstract
LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper is well-written and well-motivated, with a strong emphasis on the motivation, which is crucial in the LLM-as-a-Judge domain. - The paper proposes a new benchmark dataset for LLM-based judging, called JudgeBench, and demonstrates the effectiveness and challenging nature of their proposed dataset compared to other similar judging benchmarks, which is a significant contribution in the field of new dataset development. - The authors conduct extensive experiments to demonstrate JudgeBenc
- There is no technical novelty; however, considering that this paper proposes a novel benchmark dataset, this is acceptable, as the paper makes a significant contribution to this field. - The authors conduct extensive experiments; however, I have a few suggestions/questions that do not impact the overall rating score: - The authors investigate various biases, but I am curious about length bias, a well-documented issue in this field where LLM-based judging models tend to prefer longer response
- Highlights the overemphasis on stylistic preferences in existing LLM-as-judge benchmarks, often at the expense of true task completion. - Significant finding that many current LLM-as-judge models perform close to a random baseline on *Judge Bench*, revealing the difficulty of the prompts and the limitations of existing models.
- The benchmark includes only 360 examples, much smaller than many existing benchmarks such as Reward Bench. - Beyond accuracy metrics, failure cases would add valuable insights into judge performance. - Although the paper claims to apply a hierarchical approach, it appears that Principles 1 and 3 are largely overlooked in constructing *Judge Bench*.
1. This paper aims to address the pressing need for a reliable method/benchmark to evaluate LLM judges as LLMs become more advanced and task complexity continues to scale up. 2. This work pinpoints a critical gap in existing benchmark frameworks for LLM judges by concentrating on factual/logical correctness instead of just instruction following ability or human preference alignment. 3. The proposed benchmark is significantly more challenging than existing benchmarks on LLM judges. 4. The paper
1. The benchmark only has 350 response pairs, and it's unclear if this size is sufficient to obtain robust conclusions about LLM judge performance. Will increasing the size to 500 or 1000 change the performance rank of different LLM judges and reward models? 2. Table 1 shows that the proposed benchmark is quite challenging, with 10 out of 14 evaluated models/judges performing far below 50% accuracy. It is counterintuitive to me that most evaluated models/judges perform much worse than random gu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law · Legal Systems and Judicial Processes
MethodsFocus
