Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

Yuxi Xia; Dennis Ulmer; Terra Blevins; Yihong Liu; Hinrich Sch\"utze; Benjamin Roth

arXiv:2601.08064·cs.CL·January 14, 2026

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Sch\"utze, Benjamin Roth

PDF

Open Access

TL;DR

This paper introduces a comprehensive evaluation framework for confidence estimation in large language models, emphasizing robustness, stability, and sensitivity to language variations, revealing limitations of existing methods.

Contribution

It proposes new metrics for confidence estimation evaluation that account for prompt and answer variations, addressing gaps in current calibration-focused assessments.

Findings

01

Common CE methods lack robustness to prompt perturbations.

02

Existing methods are insensitive to semantically different answers.

03

Many CE techniques do not maintain stability across equivalent answers.

Abstract

Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making. Existing work evaluates CE methods almost exclusively through calibration, examining whether stated confidence aligns with accuracy, or discrimination, whether confidence is ranked higher for correct predictions than incorrect ones. However, these facets ignore pitfalls of CE in the context of LLMs and language variation: confidence estimates should remain consistent under semantically equivalent prompt or answer variations, and should change when the answer meaning differs. Therefore, we present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects: robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education