Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods   and a New Transcript-Classifier Approach

Tony T. Wang; John Hughes; Henry Sleight; Rylan Schaeffer; Rajashree; Agrawal; Fazl Barez; Mrinank Sharma; Jesse Mu; Nir Shavit; Ethan Perez

arXiv:2412.02159·cs.LG·December 4, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree, Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

PDF

Open Access

TL;DR

This paper examines the challenges of defending large language models against specific jailbreaks, demonstrating limitations of existing methods and proposing a transcript-classifier approach that improves but does not fully solve the problem.

Contribution

The paper introduces a transcript-classifier defense method tailored for narrow-domain jailbreak prevention, showing its advantages over traditional defenses.

Findings

01

Existing defenses like safety training and adversarial training are insufficient.

02

The transcript-classifier outperforms baseline defenses in many cases.

03

Complete prevention of jailbreaks remains challenging even in narrow domains.

Abstract

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForensic and Genetic Research

MethodsSparse Evolutionary Training · Focus