AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang; Qihui Zhang; Yuyang Liu; Xiaojun Jia; Kunpeng Ning; Jiayu Yao; Jigang Wang; Hailiang Dai; Yibing Song; Li Yuan

arXiv:2506.08473·cs.LG·January 8, 2026

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AsFT, a fine-tuning method that maintains LLM safety by constraining updates within a narrow safety basin, effectively reducing harmful behaviors while improving task performance.

Contribution

We propose AsFT, a novel fine-tuning approach that explicitly constrains update directions to preserve safety, addressing vulnerabilities caused by orthogonal perturbations.

Findings

01

AsFT reduces harmful behaviors by up to 7.60%.

02

AsFT improves task performance by 3.44%.

03

AsFT outperforms existing safety-preserving methods.

Abstract

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-yuangroup/asft
pytorchOfficial

Videos

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Topic Modeling