Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
Yuntai Bao, Qinfeng Li, Xinyan Yu, Xuhong Zhang, Ge Su, Wenqi Zhang, Liu Yan, Haiqin Weng, and Jianwei Yin

TL;DR
This paper introduces a novel training method for steering vectors in large language models, enabling effective prompt-only interventions without sacrificing generation quality or requiring manual factor tuning.
Contribution
It proposes joint training of steering factors and directions, and introduces Prompt-only SV (PrOSV) that intervenes only on prompt tokens, improving robustness and utility.
Findings
PrOSV outperforms traditional FSSVs on AxBench.
Joint training eliminates the need for post-hoc factor selection.
PrOSV achieves better tradeoff between utility and robustness.
Abstract
Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
