Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?

Yongjie Wang; Yibo Wang; Xin Zhou; Zhiqi Shen

arXiv:2505.18575·cs.AI·May 27, 2025

Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?

Yongjie Wang, Yibo Wang, Xin Zhou, Zhiqi Shen

PDF

Open Access

TL;DR

This paper explores the relationship between LLM response uncertainty and probe performance, revealing that lower uncertainty correlates with better interpretability and identifying features influencing this dynamic.

Contribution

It introduces a quantitative analysis linking response uncertainty with probe effectiveness and feature importance, advancing understanding of LLM interpretability.

Findings

01

Lower response uncertainty improves probe performance

02

High response variance involves more important features

03

Response uncertainty helps identify interpretable LLM representations

Abstract

Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset's suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM's generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsALIGN · Sparse Evolutionary Training