Who can we trust? LLM-as-a-jury for Comparative Assessment

Mengjie Qian; Guangzhi Sun; Mark J.F. Gales; Kate M. Knill

arXiv:2602.16610·cs.CL·February 19, 2026

Who can we trust? LLM-as-a-jury for Comparative Assessment

Mengjie Qian, Guangzhi Sun, Mark J.F. Gales, Kate M. Knill

PDF

Open Access

TL;DR

This paper introduces BT-sigma, a judge-aware model for aggregating LLM-based pairwise comparisons, which accounts for judge reliability and improves ranking accuracy without requiring supervised calibration.

Contribution

It proposes BT-sigma, an extension of the Bradley-Terry model that jointly infers item rankings and judge reliability from pairwise comparisons, enhancing LLM evaluation methods.

Findings

01

BT-sigma outperforms averaging-based methods on benchmark datasets.

02

Learned discriminator correlates with judgment cycle consistency.

03

BT-sigma acts as an unsupervised judge calibration mechanism.

Abstract

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Psychometric Methodologies and Testing · Explainable Artificial Intelligence (XAI)