Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with   Prompt Evaluation

Stuart Armstrong; Matija Franklin; Connor Stevens; Rebecca Gorman

arXiv:2502.00580·cs.CR·February 4, 2025

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart Armstrong, Matija Franklin, Connor Stevens, Rebecca Gorman

PDF

Open Access 1 Repo

TL;DR

The paper introduces DATDP, a prompt evaluation method that effectively detects and blocks jailbreaking attempts in large language models, significantly enhancing AI safety.

Contribution

It presents a novel prompt evaluation approach that explicitly detects jailbreaking prompts, improving safety across various LLMs with minimal added cost.

Findings

01

DATDP blocked 100% of successful jailbreaks in tested scenarios.

02

The method is effective even with smaller evaluation models like LLaMa-3-8B.

03

Adding DATDP significantly increases safety in generative AI systems.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that $100%$ of the BoN paper's successful jailbreaks (confidence interval $[99.65%, 100.00%]$ ) and $99.8%$ of successful jailbreaks in our replication (confidence interval $[99.28%, 99.98%]$ ) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors--unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts--until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alignedai/DATDP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Cryptographic Implementations and Security · Law, Rights, and Freedoms