Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Jianfeng Si, Lin Sun, Zhewen Tan, Xiangzheng Zhang

TL;DR
This paper introduces a unified co-training framework for LLM safety that uses magic tokens for dynamic behavioral switching, achieving robust safety alignment with reduced complexity and cost.
Contribution
The paper presents a novel co-training method that integrates multiple safety behaviors into a single model, enabling flexible, post-deployment control via magic tokens.
Findings
Matches safety quality of larger models like DPO.
Surpasses DeepSeek-R1 in safety performance.
Reduces training complexity and deployment costs.
Abstract
Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
