TL;DR
This paper introduces a stability filtering method for selecting reliable control points in large language models, significantly improving reasoning behavior steering accuracy and transferability across models.
Contribution
It develops a probabilistic model to identify stable behavioral boundaries and proposes a filtering technique that enhances reasoning steering effectiveness.
Findings
Achieves 0.784 accuracy on MATH-500 with stability filtering.
Improves transferability of steering vectors across models within the same architecture.
Reduces behavioral instability from 93.3% to more stable boundaries.
Abstract
Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
