Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
Yu Fu, Longxuan Yu, Haz Sameen Shahgir, Zhipeng Wei, Hui Liu, N. Benjamin Erichson, Yue Dong

TL;DR
This paper introduces on-policy self-distillation (OPSA) for safety alignment in large language models, reducing the safety tax by focusing on safety reasoning rather than mere safety appearance.
Contribution
It proposes a novel on-policy self-distillation method that improves safety-reasoning tradeoff by activating latent safety reasoning, outperforming off-policy and external-teacher distillation.
Findings
OPSA achieves stronger safety-reasoning tradeoff than baselines.
Largest gains observed on smaller models (+8.85 points on R1-Distill-1.5B).
Gains persist across training sizes and jailbreak evaluations.
Abstract
Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
