PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang; Qizhen Zhang; Jakob Foerster

arXiv:2405.07932·cs.CL·May 15, 2024

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang, Qizhen Zhang, Jakob Foerster

PDF

Open Access 1 Repo

TL;DR

PARDEN is a simple yet effective method that defends against LLM jailbreaks by asking models to repeat their outputs, significantly reducing false positives without fine-tuning or model access.

Contribution

The paper introduces PARDEN, a novel approach that avoids domain shift by prompting models to repeat outputs, outperforming existing jailbreak detection methods.

Findings

01

PARDEN achieves an 11x reduction in false positive rate at 90% TPR for Llama-2-7B.

02

The method significantly outperforms baseline jailbreak detection techniques.

03

PARDEN does not require fine-tuning or white-box access to models.

Abstract

Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ed-zh/parden
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCriminal Law and Evidence · Criminal Justice and Corrections Analysis · Jury Decision Making Processes

MethodsLLaMA