SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Shaona Ghosh; Amrita Bhattacharjee; Yftah Ziser; Christopher Parisien

arXiv:2506.04250·cs.LG·June 6, 2025

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien

PDF

Open Access

TL;DR

SafeSteer introduces a simple, gradient-free method for precise safety control in large language models, enabling safety adjustments without explicit refusals or complex training, while maintaining output quality and relevance.

Contribution

The paper presents a novel, unsupervised, activation steering approach for safety control in LLMs that is simple, effective, and does not require contrastive safe data.

Findings

01

Effective safety control across various LLMs and datasets

02

Prevents blanket refusals while maintaining topic relevance

03

Outperforms complex methods in activation steering

Abstract

Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection