Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment
Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

TL;DR
Safe Transformer introduces an explicit safety bit within language models, enabling interpretability and controllability of safety judgments, significantly reducing attack success rates in safety benchmarks.
Contribution
It proposes a modular safety mechanism with an explicit safety bit, allowing transparent safety decisions and manual control without retraining the entire model.
Findings
Achieves near-zero attack success rate in safety benchmarks.
Outperforms baseline models and safety fine-tuning methods.
Maintains generation capabilities while providing interpretability.
Abstract
Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when and refusals when - while additional unsupervised bits encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
