SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals

Peixuan Han; Cheng Qian; Xiusi Chen; Yuji Zhang; Heng Ji; Denghui Zhang

arXiv:2502.01042·cs.LG·September 16, 2025

SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals

Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Heng Ji, Denghui Zhang

PDF

Open Access 1 Models 1 Video

TL;DR

SafeSwitch introduces a dynamic safety mechanism for LLMs that detects harmful intentions internally and activates safety responses, significantly reducing unsafe outputs while maintaining utility.

Contribution

The paper presents SafeSwitch, a novel framework that leverages internal activation signals for real-time safety regulation in LLMs, with minimal parameter tuning.

Findings

01

Reduces harmful outputs by approximately 80%

02

Maintains strong utility and context-aware refusals

03

Uses less than 6% of parameters for tuning

Abstract

Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs' internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
HakHan/SafeSwitch
model

Videos

SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals· underline

Taxonomy

TopicsReal-time simulation and control systems