Understanding Reasoning in Thinking Language Models via Steering Vectors

Constantin Venhoff; Iv\'an Arcuschin; Philip Torr; Arthur Conmy; Neel Nanda

arXiv:2506.18167·cs.LG·October 23, 2025

Understanding Reasoning in Thinking Language Models via Steering Vectors

Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

PDF

TL;DR

This paper introduces a method to control and steer reasoning behaviors in large language models by identifying and manipulating linear directions in their activation space, enhancing interpretability and control.

Contribution

The work presents a novel approach to steer reasoning behaviors in thinking LLMs using linear vectors, validated across multiple models and diverse tasks.

Findings

01

Linear directions in activation space mediate reasoning behaviors.

02

Steering vectors effectively control behaviors like uncertainty and backtracking.

03

Method generalizes across different model architectures.

Abstract

Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.