The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tram\`er

TL;DR
This paper demonstrates that adaptive, resource-intensive attack strategies can effectively bypass most current language model defenses against jailbreaks and prompt injections, highlighting the need for more robust evaluation methods.
Contribution
The authors introduce a systematic approach to evaluate language model defenses against adaptive attackers, revealing vulnerabilities in recent defenses that claimed high robustness.
Findings
Most defenses were bypassed with over 90% success rate
Adaptive attacks outperform previous static or weak optimization methods
Many defenses claimed near-zero attack success rates are vulnerable to stronger attacks
Abstract
How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of…
Peer Reviews
Decision·Submitted to ICLR 2026
1 This paper is well-written. 2 The soundness of this method is good. 3 The experiments are quite solid. 4 The findings proposed in this paper are interesting to the community.
From my view, this paper has no obvious weakness, and I think the solid evaluation made by the authors and the insights proposed in this paper should be highlighted to the adversarial community.
1. Strong empirical contribution. The authors conduct a comprehensive empirical study, evaluating 12 well-known LLM defenses under multiple adaptive attack paradigms. The scale and depth of this evaluation are impressive and provide much-needed realism in jailbreak defense assessment. 2. Methodological framework. The generalized “PSSU” adaptive attack loop unifies existing attack types under a single conceptual lens. This provides a reusable structure for future evaluation tools. 3. High pract
1. Cross-method comparability and interpretability. Because the 12 defenses come from heterogeneous benchmarks and threat settings, the reported attack success rates are not directly comparable. The paper explicitly states this limitation, but it weakens the quantitative conclusions — making it hard to decide which defense class performs better. As a result, the framework cannot currently guide defense selection. 2. Limited extensibility and sustainability. The framework lacks an explicit plan
1. The paper addresses a highly relevant and pressing issue in LLM safety, evaluating the true robustness of jailbreaking and prompt-injection defenses. 2. The authors conduct an extensive and systematic analysis of 12 diverse defense mechanisms, providing a thorough assessment.
1. The proposed attack framework requires more clarification for different threat models. For example, under white-box access, what exactly is the attacker optimizing? Are they generating a suffix, a full prompt template, or token-level perturbations, and which optimization strategies are available or suitable in each case? Conversely, under black-box access, what are the precise inputs and outputs for each attack family (gradient-based, RL, search, human red-teaming), and how do those strategie
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Network Security and Intrusion Detection
