One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Jacob Dunefsky; Arman Cohan

arXiv:2502.18862·cs.LG·August 14, 2025

One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Jacob Dunefsky, Arman Cohan

PDF

TL;DR

This paper introduces a method for optimizing steering vectors in large language models using a single example, enabling effective control over safety and alignment behaviors without extensive datasets.

Contribution

The authors propose a one-shot gradient descent approach for optimizing steering vectors, demonstrating their effectiveness in mediating safety and misalignment behaviors across models.

Findings

01

Optimized steering vectors can induce harmful behaviors in models.

02

One-shot SVs transfer effectively across different inputs.

03

SV optimization reveals insights into model misalignment and recovery.

Abstract

Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.