Controllable LLM Reasoning via Sparse Autoencoder-Based Steering
Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, Fuli Feng

TL;DR
This paper introduces SAE-Steering, a novel method using sparse autoencoders to control and improve reasoning strategies in large reasoning models, leading to more reliable and accurate reasoning paths.
Contribution
It proposes a two-stage feature identification pipeline with sparse autoencoders to effectively control reasoning strategies in LRMs, surpassing existing methods in control effectiveness.
Findings
SAE-Steering filters out over 99% of irrelevant features.
It achieves over 15% improvement in control effectiveness.
It improves reasoning accuracy by 7% by redirecting erroneous paths.
Abstract
Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs' hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
