Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng; Andrew Gambardella; Gouki Minegishi; Takeshi Kojima; Yusuke Iwasawa; Yutaka Matsuo

arXiv:2603.06727·cs.LG·March 10, 2026

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

PDF

Open Access

TL;DR

Safe Transformer introduces an explicit safety bit within language models, enabling interpretability and controllability of safety judgments, significantly reducing attack success rates in safety benchmarks.

Contribution

It proposes a modular safety mechanism with an explicit safety bit, allowing transparent safety decisions and manual control without retraining the entire model.

Findings

01

Achieves near-zero attack success rate in safety benchmarks.

02

Outperforms baseline models and safety fine-tuning methods.

03

Maintains generation capabilities while providing interpretability.

Abstract

Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s = 1$ and refusals when $s = 0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques