Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi,, Samuel R. Bowman

TL;DR
This paper introduces KL-then-steer (KTS), a method to mitigate harmful model behaviors like jailbreaks while preserving helpfulness, by training models to reduce side effects of steering through KL divergence minimization.
Contribution
The paper proposes KTS, a novel training technique that reduces steering side effects, improving post-deployment control of language models against jailbreaks and bias.
Findings
Prevents 44% of jailbreak attacks on Llama-2-chat-7B.
Maintains model helpfulness close to original on benign inputs.
Reduces bias towards user-suggested answers on TruthfulQA.
Abstract
Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation
