Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang; Yangfan Hu; Kejia Chen; Lipeng He; Jiachen Ma; Jian Lou; Dan Li; Jian Liu; Xiaohu Yang; Ruoxi Jia

arXiv:2601.10141·cs.LG·January 16, 2026

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia

PDF

Open Access

TL;DR

This paper investigates the geometric relationship between safety and utility gradients in fine-tuned LLMs, proposing a method to preserve safety without sacrificing task performance, and demonstrating its robustness against attacks.

Contribution

It introduces a novel safety-preserving fine-tuning method (SPF) that removes conflicting safety gradients, ensuring safety retention while maintaining utility in LLMs.

Findings

01

Safety gradients are low-rank, utility gradients are high-dimensional.

02

Safety and utility gradient subspaces are often negatively correlated.

03

SPF effectively maintains safety and utility, even under adversarial fine-tuning.

Abstract

Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Security and Verification in Computing