Testing the Limits of Jailbreaking Defenses with the Purple Problem

Taeyoun Kim; Suhas Kotha; Aditi Raghunathan

arXiv:2403.14725·cs.CR·June 25, 2024·1 cites

Testing the Limits of Jailbreaking Defenses with the Purple Problem

Taeyoun Kim, Suhas Kotha, Aditi Raghunathan

PDF

Open Access 1 Repo

TL;DR

This paper critically evaluates the robustness of current jailbreaking defenses for language models by testing their ability to prevent outputs containing the word 'purple', revealing significant shortcomings in existing methods.

Contribution

It introduces a simple, well-defined test case for unsafe outputs and demonstrates that current defenses fail, highlighting the need for more effective enforcement mechanisms.

Findings

01

Existing defenses fail on simple 'purple' output test

02

Current safety benchmarks may not effectively evaluate enforcement robustness

03

Highlights the gap between safety definitions and enforcement effectiveness

Abstract

The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kothasuhas/purple-problem
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Law, AI, and Intellectual Property