RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
Sadia Asif, Mohammad Mohammadi Amiri

TL;DR
RefusalGuard is a novel fine-tuning method that preserves safety-related representations in language models, maintaining safety and utility by constraining representation updates during adaptation.
Contribution
This work introduces REFUSALGUARD, a representation-level fine-tuning framework that stabilizes safety features while enabling task-specific learning, addressing safety degradation in LLMs.
Findings
REFUSALGUARD maintains safety behavior comparable to base models.
It achieves competitive task performance while improving safety on adversarial benchmarks.
The method outperforms existing baselines in safety preservation during fine-tuning.
Abstract
Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within the model's activation space, how these representations change during fine-tuning and why alignment degrades remains poorly understood. In this work, we investigate the representation-level mechanisms underlying alignment degradation. Our analysis shows that standard fine-tuning induces systematic drift in safety-relevant representations, distorts their geometric structure, and introduces interference between task optimization and safety features. These effects collectively lead to increased harmful compliance. Motivated by these findings, we introduce REFUSALGUARD, a representation-level fine-tuning framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
