VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

Jasmine Qi; Danylo Dantsev; Muyang Sun

arXiv:2605.11334·cs.LG·May 13, 2026

VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

Jasmine Qi, Danylo Dantsev, Muyang Sun

PDF

TL;DR

VERDI is a novel method that estimates confidence in verification-based LLM judgments by analyzing reasoning traces without extra inference calls, improving reliability across models and benchmarks.

Contribution

It introduces a decomposition-based confidence estimation technique that leverages existing reasoning traces, enabling more accurate trust signals without additional inference overhead.

Findings

01

VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini benchmarks.

02

It outperforms log-probability-based confidence signals, especially where they saturate.

03

The method generalizes across models and can be scaled with a small NLI model.

Abstract

LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.