JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai

TL;DR
This paper evaluates the effectiveness of large language model-based judges in ranking generative AI systems, emphasizing the importance of system-level assessment and bias analysis for fair comparisons.
Contribution
It introduces the first large-scale study of LLM judges as system rankers, focusing on their bias, decisiveness, and alignment with human rankings.
Findings
LLM judges can produce system rankings comparable to human judgments.
Biases in LLM judges significantly affect system ranking accuracy.
System-level evaluation reveals critical factors overlooked by instance-based assessments.
Abstract
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsJudicial and Constitutional Studies
MethodsSparse Evolutionary Training
