Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
Guoli Wang, Haonan Shi, Tu Ouyang, An Wang

TL;DR
This paper introduces PACT, a fine-tuning method that stabilizes safety behavior in large language models by constraining safety-related token confidence, thereby preventing safety alignment drift while maintaining task performance.
Contribution
The paper proposes a novel fine-tuning framework that constrains safety token confidence to preserve safety alignment without degrading overall model utility.
Findings
PACT effectively maintains safety behavior during fine-tuning.
The approach avoids the performance trade-offs of existing methods.
Constrained tokens focus preserves safety without global restrictions.
Abstract
Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Safety Systems Engineering in Autonomy
