SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
Dongxin Guo, Jikun Wu, Siu Ming Yiu

TL;DR
SafeAnchor is a novel framework that preserves safety alignment in large language models during continual multi-domain adaptation by constraining safety-related parameter updates and monitoring safety drift.
Contribution
It introduces a method to maintain safety in LLMs across multiple domains using safety subspace identification and orthogonal gradient constraints.
Findings
SafeAnchor retains 93.2% of original safety alignment.
Outperforms all baselines by 18-42 points in safety retention.
Matches unconstrained fine-tuning within 1.5 points on domain tasks.
Abstract
Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
