Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J.F. Gales, Kate M. Knill

TL;DR
This paper introduces BT-sigma, a judge-aware model for aggregating LLM-based pairwise comparisons, which accounts for judge reliability and improves ranking accuracy without requiring supervised calibration.
Contribution
It proposes BT-sigma, an extension of the Bradley-Terry model that jointly infers item rankings and judge reliability from pairwise comparisons, enhancing LLM evaluation methods.
Findings
BT-sigma outperforms averaging-based methods on benchmark datasets.
Learned discriminator correlates with judgment cycle consistency.
BT-sigma acts as an unsupervised judge calibration mechanism.
Abstract
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Psychometric Methodologies and Testing · Explainable Artificial Intelligence (XAI)
