Understanding and Preserving Safety in Fine-Tuned LLMs
Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia

TL;DR
This paper investigates the geometric relationship between safety and utility gradients in fine-tuned LLMs, proposing a method to preserve safety without sacrificing task performance, and demonstrating its robustness against attacks.
Contribution
It introduces a novel safety-preserving fine-tuning method (SPF) that removes conflicting safety gradients, ensuring safety retention while maintaining utility in LLMs.
Findings
Safety gradients are low-rank, utility gradients are high-dimensional.
Safety and utility gradient subspaces are often negatively correlated.
SPF effectively maintains safety and utility, even under adversarial fine-tuning.
Abstract
Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Security and Verification in Computing
