Interpreting and Mitigating Unwanted Uncertainty in LLMs
Tiasa Singha Roy, Ayush Rajesh Jhaveri, Ilias Triantafyllopoulos

TL;DR
This paper investigates the causes of unwanted answer-flipping in Large Language Models and proposes a masking technique targeting specific attention heads to reduce this uncertainty-driven failure mode.
Contribution
It identifies non-retrieval attention heads responsible for misleading attention and demonstrates that masking them reduces answer-flipping by up to 15%.
Findings
Masking certain attention heads decreases flip behavior.
Retrieval heads are not the main cause of uncertainty.
Trade-offs exist between flip mitigation and downstream performance.
Abstract
Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
