Lisa: Lazy Safety Alignment for Large Language Models against Harmful   Fine-tuning Attack

Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Ling Liu

arXiv:2405.18641·cs.LG·October 30, 2024

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

Lisa introduces a proximal term-based method to improve safety alignment in large language models, effectively mitigating harmful fine-tuning attacks while maintaining task accuracy.

Contribution

The paper proposes Lisa, a novel lazy safety alignment approach that stabilizes bi-state optimization for LLMs using a proximal term, supported by convergence analysis.

Findings

01

Lisa significantly improves alignment performance.

02

The proximal term stabilizes the optimization process.

03

Maintains LLM accuracy on user tasks.

Abstract

Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

git-disl/lisa
pytorchOfficial

Videos

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning