A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

Mingyuan Xu; Xinzi Tan; Jiawei Wu; Doudou Zhou

arXiv:2601.21817·stat.ML·January 30, 2026

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

Mingyuan Xu, Xinzi Tan, Jiawei Wu, Doudou Zhou

PDF

Open Access

TL;DR

This paper introduces a judge-aware ranking framework for evaluating large language models without ground truth, accounting for judge reliability to improve ranking accuracy and uncertainty estimation.

Contribution

It extends the Bradley-Terry-Luce model with judge-specific parameters, enabling joint estimation of model quality and judge reliability from pairwise comparisons.

Findings

01

Improves agreement with human preferences

02

Achieves higher data efficiency than baselines

03

Provides calibrated uncertainty quantification

Abstract

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Text Readability and Simplification · Topic Modeling