Understanding Reasoning in Thinking Language Models via Steering Vectors
Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

TL;DR
This paper introduces a method to control and steer reasoning behaviors in large language models by identifying and manipulating linear directions in their activation space, enhancing interpretability and control.
Contribution
The work presents a novel approach to steer reasoning behaviors in thinking LLMs using linear vectors, validated across multiple models and diverse tasks.
Findings
Linear directions in activation space mediate reasoning behaviors.
Steering vectors effectively control behaviors like uncertainty and backtracking.
Method generalizes across different model architectures.
Abstract
Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
