PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Ziyang Zhang, Qizhen Zhang, Jakob Foerster

TL;DR
PARDEN is a simple yet effective method that defends against LLM jailbreaks by asking models to repeat their outputs, significantly reducing false positives without fine-tuning or model access.
Contribution
The paper introduces PARDEN, a novel approach that avoids domain shift by prompting models to repeat outputs, outperforming existing jailbreak detection methods.
Findings
PARDEN achieves an 11x reduction in false positive rate at 90% TPR for Llama-2-7B.
The method significantly outperforms baseline jailbreak detection techniques.
PARDEN does not require fine-tuning or white-box access to models.
Abstract
Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCriminal Law and Evidence · Criminal Justice and Corrections Analysis · Jury Decision Making Processes
MethodsLLaMA
