TL;DR
This paper uncovers that large language models possess decomposable internal metacognitive states that influence their reasoning, and demonstrates how steering these states can modulate model behavior.
Contribution
It introduces a framework to identify and causally manipulate internal metacognitive states in LLMs, revealing their impact on reasoning and evaluation.
Findings
Metacognitive states are linearly decodable from internal activations.
Steering activations along probe directions modulates reasoning behavior.
Benchmark performance is influenced by activation of specific internal states.
Abstract
Large language models (LLMs) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model. We propose that LLMs maintain a decomposable space of functional metacognitive states: internal variables encoding factors such as evaluation awareness, self-assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer-wise profiles. Moreover, by steering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
