VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
Jasmine Qi, Danylo Dantsev, Muyang Sun

TL;DR
VERDI is a novel method that estimates confidence in verification-based LLM judgments by analyzing reasoning traces without extra inference calls, improving reliability across models and benchmarks.
Contribution
It introduces a decomposition-based confidence estimation technique that leverages existing reasoning traces, enabling more accurate trust signals without additional inference overhead.
Findings
VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini benchmarks.
It outperforms log-probability-based confidence signals, especially where they saturate.
The method generalizes across models and can be scaled with a small NLI model.
Abstract
LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
