Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou, Hua Dai, Fu Xiao

TL;DR
This paper introduces Safety Bottleneck Regularization (SBR), a novel method that enhances LLM safety by anchoring harmful query responses to safety-aligned models, effectively resisting harmful fine-tuning.
Contribution
The paper proposes SBR, shifting defense focus to the unembedding layer, providing a geometric bottleneck that maintains safety despite persistent harmful fine-tuning.
Findings
SBR reduces Harmful Score to less than 10.
A single safety anchor suffices for effective defense.
SBR preserves performance on benign tasks.
Abstract
The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
