RePD: Defending Jailbreak Attack through a Retrieval-based Prompt   Decomposition Process

Peiran Wang; Xiaogeng Liu; Chaowei Xiao

arXiv:2410.08660·cs.CR·December 2, 2024

RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

Peiran Wang, Xiaogeng Liu, Chaowei Xiao

PDF

Open Access

TL;DR

RePD is a retrieval-based prompt decomposition framework that enhances large language models' resistance to jailbreak attacks by identifying and neutralizing harmful prompts before generating responses.

Contribution

It introduces a novel retrieval-based prompt decomposition method that effectively defends LLMs against jailbreak attacks without affecting normal performance.

Findings

01

RePD significantly reduces success rate of jailbreak prompts.

02

RePD maintains high response quality on benign prompts.

03

Framework is compatible with various open-source LLMs.

Abstract

In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Advanced Malware Detection Techniques · Information and Cyber Security