Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

Chen Xiong; Xiangyu Qi; Pin-Yu Chen; Tsung-Yi Ho

arXiv:2405.20099·cs.CR·June 5, 2025·3 cites

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

Chen Xiong, Xiangyu Qi, Pin-Yu Chen, Tsung-Yi Ho

PDF

Open Access

TL;DR

This paper presents Defensive Prompt Patch (DPP), a novel, interpretable prompt-based defense mechanism that significantly reduces jailbreak success rates in large language models while maintaining their utility.

Contribution

Introduces DPP, a new prompt-based defense that effectively balances safety and utility in LLMs against jailbreak attacks, outperforming existing methods.

Findings

01

DPP reduces jailbreak attack success rates substantially.

02

DPP maintains high utility of LLMs with minimal impact.

03

DPP is scalable and adaptable across different LLM platforms.

Abstract

Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed interpretable suffix prompts that effectively thwart a wide range of standard and adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Cryptography and Data Security