Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence
Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, Mehrtash Harandi

TL;DR
Antibody is a novel defense method that mitigates harmful fine-tuning attacks on large language models by regularizing gradient influence of harmful samples, ensuring safety and performance during fine-tuning.
Contribution
The paper introduces Antibody, a two-stage strategy combining safety alignment and gradient regularization to defend against harmful fine-tuning attacks on large language models.
Findings
Effective mitigation of harmful fine-tuning attacks.
Improved safety and robustness of models during fine-tuning.
Enhanced performance on user-submitted datasets.
Abstract
Fine-tuning-as-a-service introduces a threat to Large Language Models' safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear motivation and explanation - The paper targets a significant problem. - The proposed method shows greater performance than the SOTA methods in some datasets.
- It uses the first stage to score the samples during fine-tuning. It looks like more of a single stage, where the second stage is just an extension of the first stage. - The paper improves the alignment stage with a refusal loss, which is already used and has shown its effectiveness in the Vaccine. There is only one ablation study given for the comparison of $L_{sharp}$ and $L_{refusal}$. - Assuming $K^t$ to be large in benign and $K^t$ would be small in harmful is vague and not supported eno
1. The studied problem is important and this papers offer a timely contribution. 2. The solution is elegant and very fundamental. I especially like the alignment stage component of Antibody, as it enhances safety alignment by adding a few regularizer in the safety alignment stage. 3. Amplified experimental results are given, showing the robustness of such solutions on different fine-tuning datasets, model and harmful ratios. Also, I particularly like Figure 1, **which clearly demonstrates A
The paper contains two components, I.e., alignment stage and fine-tuning stage solution. I need to separate them to write the review for each of them for clarity. For alignment stage of Antibody: 1. **Relation and contribution compared with Booster need to be clarified.** The relation with Booster is not explicitly mentioned. Particularly, the authors should clarify that Antibody inherits components from Booster. Let's look at the update rule in Eq. (11). * **Under the case $\lambda_{refus
- The problem is critical and well motivated. - The idea of optimizing the model to stay in the flat-loss region with respect to harmful samples is interesting. - Extensive theoretical analysis is provided. - Experimental results are promising and the ablation studies are extensive.
- In line 196, authors state that among the models that lie in the flat region of the harmful loss L_harm, they aim to find the one that minimizes the alignment loss L_align. But the optimization objective in Equation 3 is not exactly doing this. Opposite in order, it actually finds the weights that minimize harmful loss among the weights that minimizes alignment loss. I think we need more clarity here. - It is not clear how we should interpret Theorem 4.1 (which is the most important one in thi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification
