Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

Yuntai Bao; Qinfeng Li; Xinyan Yu; Xuhong Zhang; Ge Su; Wenqi Zhang; Liu Yan; Haiqin Weng; and Jianwei Yin

arXiv:2605.05983·cs.LG·May 8, 2026

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

Yuntai Bao, Qinfeng Li, Xinyan Yu, Xuhong Zhang, Ge Su, Wenqi Zhang, Liu Yan, Haiqin Weng, and Jianwei Yin

PDF

1 Models 1 Datasets

TL;DR

This paper introduces a novel training method for steering vectors in large language models, enabling effective prompt-only interventions without sacrificing generation quality or requiring manual factor tuning.

Contribution

It proposes joint training of steering factors and directions, and introduces Prompt-only SV (PrOSV) that intervenes only on prompt tokens, improving robustness and utility.

Findings

01

PrOSV outperforms traditional FSSVs on AxBench.

02

Joint training eliminates the need for post-hoc factor selection.

03

PrOSV achieves better tradeoff between utility and robustness.

Abstract

Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
colored-dye/axbench-steering-vector
model

Datasets

colored-dye/concept500-contrastive
dataset· 164 dl
164 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.