Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si; Lin Sun; Zhewen Tan; Xiangzheng Zhang

arXiv:2508.14904·cs.CL·January 21, 2026

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si, Lin Sun, Zhewen Tan, Xiangzheng Zhang

PDF

Open Access 2 Models 1 Video

TL;DR

This paper introduces a unified co-training framework for LLM safety that uses magic tokens for dynamic behavioral switching, achieving robust safety alignment with reduced complexity and cost.

Contribution

The paper presents a novel co-training method that integrates multiple safety behaviors into a single model, enabling flexible, post-deployment control via magic tokens.

Findings

01

Matches safety quality of larger models like DPO.

02

Surpasses DeepSeek-R1 in safety performance.

03

Reduces training complexity and deployment costs.

Abstract

Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection