Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Yu Fu; Longxuan Yu; Haz Sameen Shahgir; Zhipeng Wei; Hui Liu; N. Benjamin Erichson; Yue Dong

arXiv:2605.15239·cs.LG·May 19, 2026

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Yu Fu, Longxuan Yu, Haz Sameen Shahgir, Zhipeng Wei, Hui Liu, N. Benjamin Erichson, Yue Dong

PDF

TL;DR

This paper introduces on-policy self-distillation (OPSA) for safety alignment in large language models, reducing the safety tax by focusing on safety reasoning rather than mere safety appearance.

Contribution

It proposes a novel on-policy self-distillation method that improves safety-reasoning tradeoff by activating latent safety reasoning, outperforming off-policy and external-teacher distillation.

Findings

01

OPSA achieves stronger safety-reasoning tradeoff than baselines.

02

Largest gains observed on smaller models (+8.85 points on R1-Distill-1.5B).

03

Gains persist across training sizes and jailbreak evaluations.

Abstract

Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.