Unveiling the Latent Directions of Reflection in Large Language Models
Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu

TL;DR
This paper explores the internal mechanisms of reflection in large language models by analyzing activation directions, demonstrating how reflective behavior can be systematically identified and controlled through activation steering techniques.
Contribution
It introduces a novel activation steering methodology to characterize and manipulate reflection levels in LLMs, advancing understanding of their internal reflective processes.
Findings
Reflection can be systematically induced or suppressed via activation interventions.
Steering vectors effectively differentiate reflection levels in model activations.
Suppressing reflection is easier than stimulating it in LLMs.
Abstract
Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
