Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta; Dhruv Kumar

arXiv:2604.15302·cs.AI·April 17, 2026

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta, Dhruv Kumar

PDF

1 Repo

TL;DR

This paper introduces diagnostic tools for assessing the reliability of LLM-based judges in NLG evaluation, revealing inconsistencies and proposing conformal prediction sets for per-instance reliability measurement.

Contribution

It presents a transitivity analysis and conformal prediction methodology to evaluate and improve the reliability of LLM judges, with theoretical guarantees and cross-judge agreement evidence.

Findings

01

Widespread per-input inconsistency masked by low aggregate violation rates

02

Prediction set width correlates with document difficulty and shows cross-judge agreement

03

Relevance is judged most reliably, while fluency and consistency are less reliable.

Abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $(1)$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ( $\overset{ρ}{ˉ} = 0.8$ - $4.1%$ ), with $33$ - $67%$ of documents exhibiting at least one directed 3-cycle; and $(2)$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq (1 - α)$ coverage, with set width serving as a per-instance reliability indicator ( $r_{s} = + 0.576$ , $N = 1, 918$ , $p < 1 0^{- 100}$ , pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ( $\overset{r}{ˉ} = 0.32$ - $0.38$ ), demonstrating it captures document-level difficulty rather than judge-specific noise.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.