Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Guoli Wang; Haonan Shi; Tu Ouyang; An Wang

arXiv:2603.07445·cs.CL·March 10, 2026

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Guoli Wang, Haonan Shi, Tu Ouyang, An Wang

PDF

Open Access

TL;DR

This paper introduces PACT, a fine-tuning method that stabilizes safety behavior in large language models by constraining safety-related token confidence, thereby preventing safety alignment drift while maintaining task performance.

Contribution

The paper proposes a novel fine-tuning framework that constrains safety token confidence to preserve safety alignment without degrading overall model utility.

Findings

01

PACT effectively maintains safety behavior during fine-tuning.

02

The approach avoids the performance trade-offs of existing methods.

03

Constrained tokens focus preserves safety without global restrictions.

Abstract

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Safety Systems Engineering in Autonomy