LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
Guanghao Zhou, Panjia Qiu, Cen Chen, Hongyu Li, Mingyuan Chu, Xin Zhang, Jun Zhou

TL;DR
LSSF introduces a low-rank safety subspace fusion method that enhances safety alignment in large language models post-fine-tuning, using principal safety components and a novel entropy metric to efficiently restore safety without impairing performance.
Contribution
The paper proposes a novel low-rank safety subspace fusion framework that isolates and restores safety information in LLMs post-fine-tuning, reducing computational costs and improving safety robustness.
Findings
Effectively restores safety alignment with minimal performance impact.
Uses low-rank projection to extract stable safety components.
Introduces safety singular value entropy for dynamic safety rank estimation.
Abstract
The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities. Meanwhile, existing safety alignment methods predominantly rely on the fine-tuning process, which inadvertently leads to the increased complexity and computational resources required. To address these issues, we introduce LSSF, a novel safety re-alignment framework with \underline{L}ow-Rank \underline{S}afety \underline{S}ubspace \underline{F}usion. Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix to extract the principal components of safety vectors. Notably, this projection matrix represents the low-rank safety subspace of the LLMs, which we have observed to remain stable during fine-tuning process and is isolated from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Safety Systems Engineering in Autonomy
