Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng (1); Zhiheng Zhang (2); Daojian Zeng (1); Lincheng Jiang (3); Xieping Gao (1) ((1) Hunan Normal University; (2) University of Chinese Academy of Sciences; (3) National University of Defense Technology)

arXiv:2604.12384·cs.AI·April 15, 2026

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng (1), Zhiheng Zhang (2), Daojian Zeng (1), Lincheng Jiang (3), Xieping Gao (1) ((1) Hunan Normal University, (2) University of Chinese Academy of Sciences, (3) National University of Defense Technology)

PDF

TL;DR

This paper introduces CWAC, a novel method that enforces coupled weight and activation constraints to better preserve safety alignment in large language models during fine-tuning.

Contribution

It presents the first theoretical and practical approach that constrains weights and activations simultaneously to maintain safety in LLMs.

Findings

01

CWAC achieves the lowest harmful scores across multiple LLMs.

02

It maintains high fine-tuning accuracy while reducing harmful responses.

03

CWAC outperforms existing methods even with high harmful data ratios.

Abstract

Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.