Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence
Shibo Yu, Yingzhou Wang, Yan Chen, Guodong Li, Jin-Hong Du

TL;DR
This paper introduces Heterogeneous Judge-Aware ranking, a framework that models and separates consensus, judge sensitivity, and disagreement in multi-judge comparisons, improving ranking accuracy and interpretability.
Contribution
It develops a structured multi-judge ranking method with identifiable decomposition and an algorithm that enhances robustness and uncertainty quantification in heterogeneous judgments.
Findings
HJA improves recovery and robustness over pooled baselines.
It provides uncertainty calibration for consensus and judge-specific contrasts.
The model offers diagnostics for judge disagreement and model patterns.
Abstract
Pairwise comparisons from multiple judges are central to large language model evaluation and preference modeling, yet standard ranking pipelines often pool judgments into a single score vector, treating systematic judge disagreement as noise. We propose Heterogeneous Judge-Aware (HJA) ranking, a structured multi-judge ranking framework that separates consensus ranking, judge-specific sensitivity to consensus, and residual preference disagreement. HJA thereby treats ranking, judge sensitivity, and structured disagreement as separate inferential targets. We establish conditions under which this decomposition is identifiable and develop an anchored alternating algorithm that preserves the identifying geometry. For confidence quantification, we study a fixed-panel repeated-comparison regime in which the judge panel may remain fixed or modest while information grows through repeated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
