Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan, Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

TL;DR
This paper introduces a backdoor-based method to improve safety alignment of fine-tuned large language models, effectively defending against jailbreak attacks with minimal safety data.
Contribution
The proposed Backdoor Enhanced Safety Alignment method uses secret prompts and minimal safety examples to secure models against jailbreak attacks during fine-tuning.
Findings
Achieves similar safety performance with as few as 11 safety examples.
Effectively defends against jailbreak attacks with limited safety data.
Maintains benign performance while enhancing safety.
Abstract
Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Adversarial Robustness in Machine Learning
Methodstravel james · Linear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Layer Normalization
