LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
Daoyuan Wu, Shuai Wang, Yang Liu, Ning Liu

TL;DR
This paper introduces SELFDEFEND, a lightweight defense mechanism for large language models that detects and blocks harmful prompts used in jailbreak attacks, effectively enhancing safety with minimal delay.
Contribution
The paper proposes a practical, real-time defense method against various jailbreak attacks by detecting harmful prompts through a shadow stack in LLMs.
Findings
SELFDEFEND effectively detects harmful prompts in jailbreak scenarios
The method introduces minimal delay for normal prompts
Manual analysis shows robustness against multiple jailbreak techniques
Abstract
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs). A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations · Law, Economics, and Judicial Systems · Law, AI, and Intellectual Property
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Attention Dropout · Dense Connections · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Softmax · Layer Normalization
