Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree, Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

TL;DR
This paper examines the challenges of defending large language models against specific jailbreaks, demonstrating limitations of existing methods and proposing a transcript-classifier approach that improves but does not fully solve the problem.
Contribution
The paper introduces a transcript-classifier defense method tailored for narrow-domain jailbreak prevention, showing its advantages over traditional defenses.
Findings
Existing defenses like safety training and adversarial training are insufficient.
The transcript-classifier outperforms baseline defenses in many cases.
Complete prevention of jailbreaks remains challenging even in narrow domains.
Abstract
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForensic and Genetic Research
MethodsSparse Evolutionary Training · Focus
