Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Songping Peng (1), Zhiheng Zhang (2), Daojian Zeng (1), Lincheng Jiang (3), Xieping Gao (1) ((1) Hunan Normal University, (2) University of Chinese Academy of Sciences, (3) National University of Defense Technology)

TL;DR
This paper introduces CWAC, a novel method that enforces coupled weight and activation constraints to better preserve safety alignment in large language models during fine-tuning.
Contribution
It presents the first theoretical and practical approach that constrains weights and activations simultaneously to maintain safety in LLMs.
Findings
CWAC achieves the lowest harmful scores across multiple LLMs.
It maintains high fine-tuning accuracy while reducing harmful responses.
CWAC outperforms existing methods even with high harmful data ratios.
Abstract
Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
