Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong; Zhiyuan He; Pin-Yu Chen; Ching-Yun Ko; Tsung-Yi Ho

arXiv:2602.04896·cs.CR·February 6, 2026

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho

PDF

Open Access

TL;DR

Activation steering in large language models, while useful for alignment, can unintentionally weaken safety measures and significantly increase vulnerability to jailbreak attacks, highlighting the need for careful safety auditing.

Contribution

This paper uncovers the phenomenon of Steering Externalities, demonstrating how benign activation steering can erode safety guardrails and elevate jailbreak risks in LLMs.

Findings

01

Steering vectors from benign datasets can erode safety guardrails.

02

Activation steering increases jailbreak success rates to over 80%.

03

Safety externalities are a critical blind spot in deployment.

Abstract

Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)