SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad

TL;DR
SCOPE introduces a selective pairwise evaluation framework for LLM judges, using a novel entropy-based uncertainty measure to ensure calibrated error rates and high coverage across benchmarks.
Contribution
The paper proposes SCOPE, a new method that calibrates LLM judgments with statistical guarantees and introduces BPE for bias-neutral uncertainty estimation.
Findings
SCOPE achieves target error rates across multiple benchmarks.
BPE improves uncertainty quality over standard confidence proxies.
SCOPE accepts significantly more judgments while maintaining risk bounds.
Abstract
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level . To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification
