One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Jacob Dunefsky, Arman Cohan

TL;DR
This paper introduces a method for optimizing steering vectors in large language models using a single example, enabling effective control over safety and alignment behaviors without extensive datasets.
Contribution
The authors propose a one-shot gradient descent approach for optimizing steering vectors, demonstrating their effectiveness in mediating safety and misalignment behaviors across models.
Findings
Optimized steering vectors can induce harmful behaviors in models.
One-shot SVs transfer effectively across different inputs.
SV optimization reveals insights into model misalignment and recovery.
Abstract
Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
