LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabriel J. Perin; Runjin Chen; Xuxi Chen; Nina S. T. Hirata; Zhangyang Wang; Junyuan Hong

arXiv:2506.15606·cs.LG·July 29, 2025

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

PDF

1 Repo

TL;DR

This paper introduces LoX, a training-free method that enhances the safety robustness of aligned large language models against fine-tuning attacks by extrapolating safety-critical subspaces, leading to significant reductions in attack success rates.

Contribution

We propose LoX, a novel low-rank extrapolation technique that improves LLM safety robustness without additional training, addressing vulnerabilities in safety alignment.

Findings

01

LoX reduces attack success rates by 11% to 54%.

02

LoX moves parameters to a flatter, less sensitive zone.

03

LoX preserves model adaptability to new tasks.

Abstract

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-group/lox
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.