SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

Michael Kirchhof; Luca F\"uger; Adam Goli\'nski; Eeshan Gunesh Dhekane; Arno Blaas; Seong Joon Oh; Sinead Williamson

arXiv:2505.20295·cs.CL·February 6, 2026

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

Michael Kirchhof, Luca F\"uger, Adam Goli\'nski, Eeshan Gunesh Dhekane, Arno Blaas, Seong Joon Oh, Sinead Williamson

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SelfReflect, a metric to evaluate if LLMs can transparently communicate their internal answer distributions, revealing their uncertainty levels and limitations in doing so.

Contribution

The paper develops the SelfReflect metric to measure LLMs' ability to faithfully summarize their internal uncertainty distributions.

Findings

01

LLMs generally cannot accurately reflect their internal uncertainties.

02

Sampling and refeeding outputs improves LLMs' ability to generate faithful uncertainty summaries.

03

SelfReflect provides a fine-grained measure of LLMs' transparency about their internal beliefs.

Abstract

The common approach to communicate a large language model's (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- Novel contributions for uncertainty quantification in LLMs, offering a fresh perspective on uncertainty quantification and designing new set of metrics and evaluations; - Proposal of a new metric that is grounded in information theory concepts, such as mutual information. - Experimental design allows for the controlled assessment of the capabilities of the proposed metric, comparing it against numerous baselines including LLM-as-a-judge approaches. - Benchmark 10+ models in 3+ datasets (Trivia

Weaknesses

W1. **Controlled experiments in section 4 concern different models**, raising questions about the generalization of the results (see Questions). W2. Some sections are a bit confusing or not clearly explained (see Questions for details)

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is clearly written, experiments are reasonably documented in the appendix. 2. Theoretical Justficiation of SelfReflect metric appears to be sound, the proofs with sufficient statistics provided in the Appendix A are convincing. 3. Visible good effort to connect the theoretical propositions to experimental evaluation. Authors provide ablations of the SelfReflect metric as well as consider a broad range of LM decoding paradigms, including reasoning. 4. Human feedback studies improve

Weaknesses

1. Minor (little to no impact on my score): 1. Antropomorphising the langauge models: while subjective and stylistic, I view it as a minor negative aspect. I.e. line 077: "its internal beliefs", line 82: "making LLMs aware of their internal uncertainties", Line 478: "make LLMs honestly describe", etc. 2. Line 15: "all options it deems possible" - for a language model all options (i.e. every combinatoric token sequence) are technically "possible" unless -inf is allowed in the logits someh

Reviewer 03Rating 4Confidence 2

Strengths

1. The experimental results in Section 4 including Distinguishing Good, Bad, and Almost-Good Summaries, Multiple-choice QA and Alignment with Human Judgments are solid and sufficient. 2. The explanation of why CoT fail on reflecting the LLM’s internal confidence is very interesting.

Weaknesses

1. To be honest, I am not quite familiar with this area but would like to ask a very general question. Why do you think the topic, whether a LLM can accurately express its own confidence through natural language, is important? 2. Based on 1, do we even have a promising metric to reflect LLM’s internal probability distribution yet? If not, the topic of investigating whether a LLM can accurately express its internal probability distribution may not be reliable. 3. For Eq2, I am quite confused wh

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Academic integrity and plagiarism · Wikis in Education and Collaboration