A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth
Mingyuan Xu, Xinzi Tan, Jiawei Wu, Doudou Zhou

TL;DR
This paper introduces a judge-aware ranking framework for evaluating large language models without ground truth, accounting for judge reliability to improve ranking accuracy and uncertainty estimation.
Contribution
It extends the Bradley-Terry-Luce model with judge-specific parameters, enabling joint estimation of model quality and judge reliability from pairwise comparisons.
Findings
Improves agreement with human preferences
Achieves higher data efficiency than baselines
Provides calibrated uncertainty quantification
Abstract
Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Text Readability and Simplification · Topic Modeling
