Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen

TL;DR
This paper investigates whether LLM uncertainty and correctness are driven by the same internal features, finding they are functionally and informationally distinct, with implications for model interpretability and intervention.
Contribution
The study introduces a framework using sparse autoencoders to differentiate features related to uncertainty and correctness in LLMs, revealing their distinct roles and transferability.
Findings
Pure uncertainty features are essential for accuracy.
Pure incorrectness features are inert and do not affect accuracy when suppressed.
Targeted suppression of confounded features improves accuracy and reduces entropy.
Abstract
Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
