Targeted Vaccine: Safety Alignment for Large Language Models against   Harmful Fine-Tuning via Layer-wise Perturbation

Guozhi Liu; Weiwei Lin; Tiansheng Huang; Ruichao Mo; Qi Mu; Li Shen

arXiv:2410.09760·cs.LG·February 3, 2025

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation

Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Li Shen

PDF

Open Access 1 Repo

TL;DR

This paper introduces T-Vaccine, a targeted safety alignment method for large language models that selectively perturbs safety-critical layers to defend against harmful fine-tuning, improving efficiency and effectiveness.

Contribution

T-Vaccine is the first method to identify safety-critical layers using gradient norms and selectively perturb them, reducing resource use and enhancing defense against harmful fine-tuning.

Findings

01

T-Vaccine outperforms Vaccine in defense effectiveness.

02

T-Vaccine is more resource-efficient, suitable for large models on limited hardware.

03

T-Vaccine effectively defends 7B models against harmful fine-tuning.

Abstract

Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lslland/t-vaccine
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInhalation and Respiratory Drug Delivery · COVID-19 diagnosis using AI · vaccines and immunoinformatics approaches

Methodstravel james