SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a   Practical Manner

Xunguang Wang; Daoyuan Wu; Zhenlan Ji; Zongjie Li and; Pingchuan Ma; Shuai Wang; Yingjiu Li; Yang Liu; Ning Liu and; Juergen Rahmel

arXiv:2406.05498·cs.CR·February 6, 2025·1 cites

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li and, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu and, Juergen Rahmel

PDF

Open Access 1 Models 4 Datasets

TL;DR

SelfDefend is a practical, generic framework inspired by shadow stacks that enables large language models to defend against various jailbreak attacks efficiently, maintaining high performance and robustness.

Contribution

The paper introduces SelfDefend, a novel shadow LLM-based framework for effective, low-latency jailbreak defense applicable to multiple models and attack types.

Findings

01

SelfDefend outperforms seven state-of-the-art defenses.

02

It matches GPT-4's defense performance with lower delays.

03

The tuned models are robust to adaptive jailbreaks.

Abstract

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
DavidTKeane/cyberranger-v42
model· 51 dl· ♡ 1
51 dl♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, AI, and Intellectual Property · Digital and Cyber Forensics

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Softmax · Layer Normalization · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · {Dispute@FaQ-s}How to file a dispute with Expedia?