SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li and, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu and, Juergen Rahmel

TL;DR
SelfDefend is a practical, generic framework inspired by shadow stacks that enables large language models to defend against various jailbreak attacks efficiently, maintaining high performance and robustness.
Contribution
The paper introduces SelfDefend, a novel shadow LLM-based framework for effective, low-latency jailbreak defense applicable to multiple models and attack types.
Findings
SelfDefend outperforms seven state-of-the-art defenses.
It matches GPT-4's defense performance with lower delays.
The tuned models are robust to adaptive jailbreaks.
Abstract
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, AI, and Intellectual Property · Digital and Cyber Forensics
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Softmax · Layer Normalization · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · {Dispute@FaQ-s}How to file a dispute with Expedia?
