Guardrails in Logit Space: Safety Token Regularization for LLM Alignment
Thong Bach, Truyen Tran

TL;DR
This paper introduces safety token regularization (STR), a lightweight fine-tuning method that preserves safety properties in large language models without sacrificing utility or requiring extensive computation.
Contribution
STR is a novel safety-preserving regularization technique that constrains logits of salient tokens during fine-tuning, improving safety without degrading task performance.
Findings
STR achieves safety performance comparable to state-of-the-art methods.
STR enhances training stability and overall performance beyond safety.
STR seamlessly integrates with parameter-efficient fine-tuning methods like LoRA.
Abstract
Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretraining, leaving fine-tuned models vulnerable to behavioral shifts. In this work, we introduce safety token regularization (STR), a lightweight method designed to preserve safety properties during fine-tuning. Our approach identifies salient tokens from rejection templates of well-aligned models and constrains their associated logits during training, preventing the loss of critical safety behaviors. Unlike reinforcement learning or preference optimization methods, STR requires minimal additional computation and seamlessly integrates with parameter-efficient fine-tuning techniques such as LoRA. Comprehensive experiments demonstrate that our approach achieves safety performance on par with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
