Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

Het Patel; Tiejin Chen; Hua Wei; Evangelos E. Papalexakis; Jia Chen

arXiv:2604.19974·cs.LG·April 23, 2026

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen

PDF

TL;DR

This paper investigates whether LLM uncertainty and correctness are driven by the same internal features, finding they are functionally and informationally distinct, with implications for model interpretability and intervention.

Contribution

The study introduces a framework using sparse autoencoders to differentiate features related to uncertainty and correctness in LLMs, revealing their distinct roles and transferability.

Findings

01

Pure uncertainty features are essential for accuracy.

02

Pure incorrectness features are inert and do not affect accuracy when suppressed.

03

Targeted suppression of confounded features improves accuracy and reduces entropy.

Abstract

Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.