Vaccine: Perturbation-aware Alignment for Large Language Models against   Harmful Fine-tuning Attack

Tiansheng Huang; Sihao Hu; Ling Liu

arXiv:2402.01109·cs.LG·November 26, 2024·1 cites

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack

Tiansheng Huang, Sihao Hu, Ling Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

Vaccine is a perturbation-aware alignment method that enhances the robustness of large language models against harmful fine-tuning attacks by producing invariant embeddings through crafted perturbations.

Contribution

We introduce Vaccine, a novel technique that mitigates harmful embedding drift during fine-tuning, improving LLM security without sacrificing performance.

Findings

01

Vaccine improves robustness against harmful prompts

02

It preserves reasoning ability on benign prompts

03

Effective on models like Llama2, Opt, Vicuna

Abstract

The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

git-disl/vaccine
pytorchOfficial

Videos

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack· slideslive

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques