SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah; Ali Emami; Hassan Sajjad

arXiv:2602.13110·cs.CL·February 20, 2026

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad

PDF

Open Access

TL;DR

SCOPE introduces a selective pairwise evaluation framework for LLM judges, using a novel entropy-based uncertainty measure to ensure calibrated error rates and high coverage across benchmarks.

Contribution

The paper proposes SCOPE, a new method that calibrates LLM judgments with statistical guarantees and introduces BPE for bias-neutral uncertainty estimation.

Findings

01

SCOPE achieves target error rates across multiple benchmarks.

02

BPE improves uncertainty quality over standard confidence proxies.

03

SCOPE accepts significantly more judgments while maintaining risk bounds.

Abstract

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$ . To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification