JuStRank: Benchmarking LLM Judges for System Ranking

Ariel Gera; Odellia Boni; Yotam Perlitz; Roy Bar-Haim; Lilach Eden; Asaf Yehudai

arXiv:2412.09569·cs.CL·June 11, 2025

JuStRank: Benchmarking LLM Judges for System Ranking

Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper evaluates the effectiveness of large language model-based judges in ranking generative AI systems, emphasizing the importance of system-level assessment and bias analysis for fair comparisons.

Contribution

It introduces the first large-scale study of LLM judges as system rankers, focusing on their bias, decisiveness, and alignment with human rankings.

Findings

01

LLM judges can produce system rankings comparable to human judgments.

02

Biases in LLM judges significantly affect system ranking accuracy.

03

System-level evaluation reveals critical factors overlooked by instance-based assessments.

Abstract

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibm-research/justrank_judge_scores
dataset· 4 dl
4 dl

Videos

JuStRank: Benchmarking LLM Judges for System Ranking· underline

Taxonomy

TopicsJudicial and Constitutional Studies

MethodsSparse Evolutionary Training