Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Stuart Armstrong, Matija Franklin, Connor Stevens, Rebecca Gorman

TL;DR
The paper introduces DATDP, a prompt evaluation method that effectively detects and blocks jailbreaking attempts in large language models, significantly enhancing AI safety.
Contribution
It presents a novel prompt evaluation approach that explicitly detects jailbreaking prompts, improving safety across various LLMs with minimal added cost.
Findings
DATDP blocked 100% of successful jailbreaks in tested scenarios.
The method is effective even with smaller evaluation models like LLaMa-3-8B.
Adding DATDP significantly increases safety in generative AI systems.
Abstract
Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that of the BoN paper's successful jailbreaks (confidence interval ) and of successful jailbreaks in our replication (confidence interval ) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors--unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts--until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Cryptographic Implementations and Security · Law, Rights, and Freedoms
