Testing the Limits of Jailbreaking Defenses with the Purple Problem
Taeyoun Kim, Suhas Kotha, Aditi Raghunathan

TL;DR
This paper critically evaluates the robustness of current jailbreaking defenses for language models by testing their ability to prevent outputs containing the word 'purple', revealing significant shortcomings in existing methods.
Contribution
It introduces a simple, well-defined test case for unsafe outputs and demonstrates that current defenses fail, highlighting the need for more effective enforcement mechanisms.
Findings
Existing defenses fail on simple 'purple' output test
Current safety benchmarks may not effectively evaluate enforcement robustness
Highlights the gap between safety definitions and enforcement effectiveness
Abstract
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Law, AI, and Intellectual Property
