Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu; Weiwei Lin; Tiansheng Huang; Ruichao Mo; Qi Mu; Xiumin Wang; Li Shen

arXiv:2602.05228·cs.AI·February 12, 2026

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Xiumin Wang, Li Shen

PDF

Open Access

TL;DR

This paper introduces Surgery, a method that uses attention sink mechanisms to mitigate harmful fine-tuning in large language models, improving safety performance across multiple benchmarks.

Contribution

It proposes a novel attention sink-based regularizer and the separable sink divergence hypothesis to effectively reduce harmful pattern learning during fine-tuning.

Findings

01

Surgery improves safety benchmark scores by up to 11.25%.

02

Attention heads with positive sink divergence correlate with increased harmfulness.

03

The sink divergence regularizer effectively steers attention heads away from harmful patterns.

Abstract

Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis -- \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)