Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models
Tianyi Zhou, Johanne Medina, Sanjay Chawla

TL;DR
This paper explores how large language models can estimate the reliability of their responses by analyzing token-level uncertainties, aiming to detect confabulations and improve trustworthiness.
Contribution
It introduces a novel uncertainty-based method for assessing LLM response reliability, revealing limitations of direct uncertainty signals and enhancing detection of unreliable outputs.
Findings
Correct in-context info boosts answer accuracy and confidence.
Misleading context leads to confidently incorrect responses.
Uncertainty-guided probing improves reliability detection across models.
Abstract
Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
