LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A   Vision Paper

Daoyuan Wu; Shuai Wang; Yang Liu; Ning Liu

arXiv:2402.15727·cs.CR·March 5, 2024·1 cites

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

Daoyuan Wu, Shuai Wang, Yang Liu, Ning Liu

PDF

Open Access

TL;DR

This paper introduces SELFDEFEND, a lightweight defense mechanism for large language models that detects and blocks harmful prompts used in jailbreak attacks, effectively enhancing safety with minimal delay.

Contribution

The paper proposes a practical, real-time defense method against various jailbreak attacks by detecting harmful prompts through a shadow stack in LLMs.

Findings

01

SELFDEFEND effectively detects harmful prompts in jailbreak scenarios

02

The method introduces minimal delay for normal prompts

03

Manual analysis shows robustness against multiple jailbreak techniques

Abstract

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs). A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Education and Practice Innovations · Law, Economics, and Judicial Systems · Law, AI, and Intellectual Property

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Attention Dropout · Dense Connections · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Softmax · Layer Normalization