Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Quoc Minh Nguyen; Trung Le; Jing Wu; Anh Tuan Bui; Mehrtash Harandi

arXiv:2603.00498·cs.LG·March 3, 2026

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, Mehrtash Harandi

PDF

Open Access 3 Reviews

TL;DR

Antibody is a novel defense method that mitigates harmful fine-tuning attacks on large language models by regularizing gradient influence of harmful samples, ensuring safety and performance during fine-tuning.

Contribution

The paper introduces Antibody, a two-stage strategy combining safety alignment and gradient regularization to defend against harmful fine-tuning attacks on large language models.

Findings

01

Effective mitigation of harmful fine-tuning attacks.

02

Improved safety and robustness of models during fine-tuning.

03

Enhanced performance on user-submitted datasets.

Abstract

Fine-tuning-as-a-service introduces a threat to Large Language Models' safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

- Clear motivation and explanation - The paper targets a significant problem. - The proposed method shows greater performance than the SOTA methods in some datasets.

Weaknesses

- It uses the first stage to score the samples during fine-tuning. It looks like more of a single stage, where the second stage is just an extension of the first stage. - The paper improves the alignment stage with a refusal loss, which is already used and has shown its effectiveness in the Vaccine. There is only one ablation study given for the comparison of $L_{sharp}$ and $L_{refusal}$. - Assuming $K^t$ to be large in benign and $K^t$ would be small in harmful is vague and not supported eno

Reviewer 02Rating 8Confidence 5

Strengths

1. The studied problem is important and this papers offer a timely contribution. 2. The solution is elegant and very fundamental. I especially like the alignment stage component of Antibody, as it enhances safety alignment by adding a few regularizer in the safety alignment stage. 3. Amplified experimental results are given, showing the robustness of such solutions on different fine-tuning datasets, model and harmful ratios. Also, I particularly like Figure 1, **which clearly demonstrates A

Weaknesses

The paper contains two components, I.e., alignment stage and fine-tuning stage solution. I need to separate them to write the review for each of them for clarity. For alignment stage of Antibody: 1. **Relation and contribution compared with Booster need to be clarified.** The relation with Booster is not explicitly mentioned. Particularly, the authors should clarify that Antibody inherits components from Booster. Let's look at the update rule in Eq. (11). * **Under the case $\lambda_{refus

Reviewer 03Rating 6Confidence 3

Strengths

- The problem is critical and well motivated. - The idea of optimizing the model to stay in the flat-loss region with respect to harmful samples is interesting. - Extensive theoretical analysis is provided. - Experimental results are promising and the ablation studies are extensive.

Weaknesses

- In line 196, authors state that among the models that lie in the flat region of the harmful loss L_harm, they aim to find the one that minimizes the alignment loss L_align. But the optimization objective in Equation 3 is not exactly doing this. Opposite in order, it actually finds the weights that minimize harmful loss among the weights that minimizes alignment loss. I think we need more clarity here. - It is not clear how we should interpret Theorem 4.1 (which is the most important one in thi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification