TL;DR
This paper introduces LoX, a training-free method that enhances the safety robustness of aligned large language models against fine-tuning attacks by extrapolating safety-critical subspaces, leading to significant reductions in attack success rates.
Contribution
We propose LoX, a novel low-rank extrapolation technique that improves LLM safety robustness without additional training, addressing vulnerabilities in safety alignment.
Findings
LoX reduces attack success rates by 11% to 54%.
LoX moves parameters to a flatter, less sensitive zone.
LoX preserves model adaptability to new tasks.
Abstract
Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
