Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Yibo Wang; Tiansheng Huang; Li Shen; Huanjin Yao; Haotian Luo; Rui Liu; Naiqiang Tan; Jiaxing Huang; Dacheng Tao

arXiv:2501.18100·cs.CL·January 19, 2026

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

Panacea is a method that applies adaptive post-fine-tuning perturbations to large language models to effectively mitigate harmful behaviors without degrading their fine-tuning capabilities.

Contribution

The paper introduces Panacea, a novel adaptive perturbation technique that preserves model safety and performance after fine-tuning, improving robustness against harmful fine-tuning attacks.

Findings

01

Adaptive perturbations significantly reduce harmful scores by up to 21.2%.

02

Different model layers exhibit varying safety affinities.

03

Simple random perturbations can recover models from harmful behaviors.

Abstract

Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile--with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution--adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

w-yibo/panacea
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis