The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity

Tim Tomov; Dominik Fuchsgruber; Tom Wollschl\"ager; Stephan G\"unnemann

arXiv:2511.04418·cs.LG·January 30, 2026

The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity

Tim Tomov, Dominik Fuchsgruber, Tom Wollschl\"ager, Stephan G\"unnemann

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper reveals that existing uncertainty quantification methods for large language models perform poorly under ambiguous language, highlighting a critical shortcoming and the need for new approaches.

Contribution

The authors introduce the first ambiguous QA datasets with ground-truth answer distributions and demonstrate the limitations of current UQ methods under ambiguity.

Findings

01

Current UQ methods degrade to near-random performance on ambiguous data.

02

Performance deterioration is consistent across different estimation paradigms.

03

Theoretical analysis explains the fundamental limitations of existing estimators under ambiguity.

Abstract

Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 3

Strengths

I think the paper makes some important contributions to understanding when some of the current UQ methods for LLMs perform well and why -- although I think the framing is still quite specific in that only a particular type of task and general setup (like using semantic classes to categorize output) is considered. In my view, the insights are useful for the field of UQ for LLMs, novel enough to warrant interest from the broader community. I’m not sure about the practical value of the new datase

Weaknesses

I find the work limiting in some ways. The authors use a definition of total uncertainty that is not widely accepted and relies on a ground truth distribution p^*. In my view, they do not provide enough justification for this definition, choosing to merely cite a couple other papers. So while the work is useful and conceptually interesting, not enough is mentioned about the assumptions and cumbersome nature of setup. This is related to creation of the benchmark, which the authors discuss briefly

Reviewer 02Rating 4Confidence 4

Strengths

- Uncertainty quantification for LLMs is a hot topic and therefore advances in this field are definitely warranted. - The paper sheds new light on UQ especially in the case of ambiguous QA tasks. It also motivates new research on developing UQ estimators that are sensitive to ambiguity.

Weaknesses

- The results aren't that surprising. Recent work that looked at uncertainty quantification for the text-to-SQL task which is inherently ambiguous (namely, for a given natural language query there could be multiple correct SQL queries) already observed significant degradation in AUROC performance for several UQ estimators (see [1,2]). [1] Bhattacharjya et al. SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models. EMNLP 2025. [2] Bhattachariya et al. Co

Reviewer 03Rating 2Confidence 4

Strengths

The idea of assigning a ground-truth distribution to ambiguous questions is novel. Prior work primarily focused on collecting ambiguous questions and their corresponding disambiguated versions, whereas this paper explores what the ground-truth distribution should be for ambiguous cases. Also, the problem setting (examining how UQ methods behave under question ambiguity) is relevant to the reliability of LLMs.

Weaknesses

- **Questionable Motivation for the Proposed “Ground-Truth Distribution.”** The authors do not justify why the so-caled ground-truth distribution for ambigous questions is needed. From my perspective, the ambiguity is objective -- it is there, and there are multiple ways to understand an ambigous question, leading to multiple correct answers. What we need is the clarifications for the ambiguity, the disambiguations for the question, and the corresponding correct answers. It does not makes sense

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)