AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha

TL;DR
AlignGuard-LoRA is a novel fine-tuning framework that preserves language model alignment by combining Fisher-based regularization and Riemannian collision-aware techniques, significantly reducing safety drift.
Contribution
This work introduces AlignGuard-LoRA, a new method that maintains alignment during fine-tuning through Fisher-guided and Riemannian regularizations, with empirical validation and open-source release.
Findings
Mitigates alignment drift by up to 50% on safety benchmarks
Each component of AGL contributes to safety preservation
Flattens loss escalation while maintaining adaptation dynamics
Abstract
Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning · Topic Modeling
