Causal Evidence that Language Models use Confidence to Drive Behavior
Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Veli\v{c}kovi\'c, Viorica Patraucean

TL;DR
This paper provides causal evidence that language models utilize internal confidence signals, including verbal cues, to control their behavior, such as deciding when to abstain, demonstrating structured metacognitive control.
Contribution
It introduces a four-phase paradigm to causally demonstrate how LLMs use confidence signals, including verbal cues, for abstention decisions, advancing understanding of model metacognition.
Findings
Models apply an implicit confidence threshold for abstention.
Boosting confidence signals decreases abstention rates.
Verbal confidence predicts abstention independently of output distribution.
Abstract
Metacognition -- assessing the quality of one's own cognitive performance -- guides adaptive behavior across species. Substantial research demonstrates that confidence signals can be extracted from language model outputs, yet a fundamental question remains: do models actually use these signals to control behavior, such as deciding whether to answer or abstain? To investigate, we developed a four-phase paradigm. Phase~1 elicited baseline confidence estimates without an abstention option. Phase~2 revealed that LLMs apply an implicit threshold to internal confidence when deciding to abstain, with confidence effect sizes approximately an order of magnitude larger than alternative mechanisms. Phase~3 provided direct causal evidence through activation steering: boosting or suppressing confidence signals correspondingly decreased or increased abstention rates. Phase~4 extended this by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Language and cultural evolution · Embodied and Extended Cognition
