Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced   Safety Alignment

Jiongxiao Wang; Jiazhao Li; Yiquan Li; Xiangyu Qi; Junjie Hu; Yixuan; Li; Patrick McDaniel; Muhao Chen; Bo Li; Chaowei Xiao

arXiv:2402.14968·cs.CR·June 21, 2024·2 cites

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan, Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a backdoor-based method to improve safety alignment of fine-tuned large language models, effectively defending against jailbreak attacks with minimal safety data.

Contribution

The proposed Backdoor Enhanced Safety Alignment method uses secret prompts and minimal safety examples to secure models against jailbreak attacks during fine-tuning.

Findings

01

Achieves similar safety performance with as few as 11 safety examples.

02

Effectively defends against jailbreak attacks with limited safety data.

03

Maintains benign performance while enhancing safety.

Abstract

Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jayfeather1024/Backdoor-Enhanced-Alignment
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Adversarial Robustness in Machine Learning

Methodstravel james · Linear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Layer Normalization