Steering Without Side Effects: Improving Post-Deployment Control of   Language Models

Asa Cooper Stickland; Alexander Lyzhov; Jacob Pfau; Salsabila Mahdi,; Samuel R. Bowman

arXiv:2406.15518·cs.CL·June 25, 2024·1 cites

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi,, Samuel R. Bowman

PDF

Open Access 1 Repo

TL;DR

This paper introduces KL-then-steer (KTS), a method to mitigate harmful model behaviors like jailbreaks while preserving helpfulness, by training models to reduce side effects of steering through KL divergence minimization.

Contribution

The paper proposes KTS, a novel training technique that reduces steering side effects, improving post-deployment control of language models against jailbreaks and bias.

Findings

01

Prevents 44% of jailbreak attacks on Llama-2-chat-7B.

02

Maintains model helpfulness close to original on benign inputs.

03

Reduces bias towards user-suggested answers on TruthfulQA.

Abstract

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asacooperstickland/kl-then-steer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation