Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, Richeng Xuan, Houfeng Wang, Lizi Liao

TL;DR
This paper identifies overconfidence in large language models used as judges, introduces a new metric to measure confidence calibration, and proposes an ensemble framework to improve reliability and risk-awareness in evaluations.
Contribution
It systematically diagnoses overconfidence in LLMs, introduces TH-Score for calibration assessment, and presents LLM-as-a-Fuser, a novel ensemble method for trustworthy, confidence-driven evaluation.
Findings
Overconfidence significantly reduces LLM evaluation reliability.
TH-Score effectively measures confidence-accuracy alignment.
LFM-as-a-Fuser improves calibration and adaptive evaluation performance.
Abstract
Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the Overconfidence Phenomenon in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce TH-Score, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose LLM-as-a-Fuser,…
Peer Reviews
Decision·Submitted to ICLR 2026
Critical Practical Issue Targeting: Systematically identifies and characterizes the "Overconfidence Phenomenon" in LLM-as-a-Judge—a long-overlooked flaw where LLMs (e.g., GPT-4o, Mistral-Nemo) exhibit confidence far exceeding actual accuracy. This addresses a key gap in existing accuracy-centric research, as uncalibrated confidence undermines risk-aware applications (e.g., auto-approving high-confidence judgments). Innovative Practical Metric (TH-Score): Proposes TH-Score to quantify confidence
Inadequate Technical Precision and Notation Clarity: The paper lacks rigor in technical details and notation. Critical assertions like "while Bayesian methods are computationally infeasible" are made without citing supporting literature, weakening their credibility. The acronym "TH" in TH-Score is never defined, creating ambiguity about the metric’s conceptual origin. The TH-Score is misleading: "accuracy" specifically refers to the accuracy of targeted confidence intervals (high/low-confidence
1. Overconfidence is an important and underexplored issue in LLM-as-a-Judge research. Addressing it has clear practical and scientific significance. 2. The paper successfully motivates why calibration matters for judge models, especially in scenarios where high-confidence predictions may replace human evaluation.
1. Lack of detail and limited novelty in the LLM-as-a-Fuser method. The main algorithmic contribution is not clearly described. From what is presented, the fusion process seems to rely on standard majority or weighted voting with an additional critique prompt, which limits its methodological originality. A more thorough description, including architecture, prompt examples, and ablation studies (with/without critiques, number of judges, different model combinations), would strengthen this part.
* It clearly defines and systematically analyzes the "Overconfidence Phenomenon" in the scenario of Large Language Models as judges (LLM-as-a-Judge) — where the model’s predicted confidence significantly exceeds its actual correctness. * The experimental design is relatively rigorous, covering 14 mainstream LLMs (including open-source and closed-source models) and adopting three confidence calculation methods
* TH-Score Lacks Generalization and Mechanistic Validation:Only evaluated on JudgeBench, but LLM-as-a-Judge benchmarks (MTBench, FairEval, LLMBar) vary in task types (pairwise vs. single-sample) and evaluation criteria (subjective style vs. objective logic). * Unjustified hyperparameter ε: Selects ε=0.1 as optimal but provides no analysis of how ε performs across different task subdomains of JudgeBench (e.g., math vs. coding). For example, math tasks may require stricter high-confidence thresho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
