The Great Pretender: A Stochasticity Problem in LLM Jailbreak
Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit

TL;DR
This paper investigates the instability of attack success rates in LLM jailbreaks due to stochasticity, proposing new metrics and frameworks to better evaluate and generate more reliable jailbreak prompts.
Contribution
It introduces the CAS-eval and CAS-gen frameworks to assess and improve the consistency of jailbreak attack success rates against LLMs.
Findings
ASR is unstable and inflated across studies.
Jailbreak prompts show only 50% success rate despite high reported ASR.
CAS-gen framework improves attack reliability by recovering success rate losses.
Abstract
"Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
