The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr; Nicholas Carlini; Chawin Sitawarin; Sander V. Schulhoff; Jamie Hayes; Michael Ilie; Juliette Pluto; Shuang Song; Harsh Chaudhari; Ilia Shumailov; Abhradeep Thakurta; Kai Yuanqing Xiao; Andreas Terzis; and Florian Tram\`er

arXiv:2510.09023·cs.LG·October 13, 2025·2 cites

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tram\`er

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that adaptive, resource-intensive attack strategies can effectively bypass most current language model defenses against jailbreaks and prompt injections, highlighting the need for more robust evaluation methods.

Contribution

The authors introduce a systematic approach to evaluate language model defenses against adaptive attackers, revealing vulnerabilities in recent defenses that claimed high robustness.

Findings

01

Most defenses were bypassed with over 90% success rate

02

Adaptive attacks outperform previous static or weak optimization methods

03

Many defenses claimed near-zero attack success rates are vulnerable to stronger attacks

Abstract

How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1 This paper is well-written. 2 The soundness of this method is good. 3 The experiments are quite solid. 4 The findings proposed in this paper are interesting to the community.

Weaknesses

From my view, this paper has no obvious weakness, and I think the solid evaluation made by the authors and the insights proposed in this paper should be highlighted to the adversarial community.

Reviewer 02Rating 8Confidence 4

Strengths

1. Strong empirical contribution. The authors conduct a comprehensive empirical study, evaluating 12 well-known LLM defenses under multiple adaptive attack paradigms. The scale and depth of this evaluation are impressive and provide much-needed realism in jailbreak defense assessment. 2. Methodological framework. The generalized “PSSU” adaptive attack loop unifies existing attack types under a single conceptual lens. This provides a reusable structure for future evaluation tools. 3. High pract

Weaknesses

1. Cross-method comparability and interpretability. Because the 12 defenses come from heterogeneous benchmarks and threat settings, the reported attack success rates are not directly comparable. The paper explicitly states this limitation, but it weakens the quantitative conclusions — making it hard to decide which defense class performs better. As a result, the framework cannot currently guide defense selection. 2. Limited extensibility and sustainability. The framework lacks an explicit plan

Reviewer 03Rating 2Confidence 3

Strengths

1. The paper addresses a highly relevant and pressing issue in LLM safety, evaluating the true robustness of jailbreaking and prompt-injection defenses. 2. The authors conduct an extensive and systematic analysis of 12 diverse defense mechanisms, providing a thorough assessment.

Weaknesses

1. The proposed attack framework requires more clarification for different threat models. For example, under white-box access, what exactly is the attacker optimizing? Are they generating a suffix, a full prompt template, or token-level perturbations, and which optimization strategies are available or suitable in each case? Conversely, under black-box access, what are the precise inputs and outputs for each attack family (gradient-based, RL, search, human red-teaming), and how do those strategie

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Network Security and Intrusion Detection