Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Xiaomeng Hu; Pin-Yu Chen; Tsung-Yi Ho

arXiv:2507.04365·cs.CR·July 8, 2025

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

PDF

3 Reviews

TL;DR

This paper uncovers the universal phenomenon of Attention Slipping during jailbreak attacks on LLMs, and proposes an efficient defense method called Attention Sharpening that effectively mitigates these attacks without extra overhead.

Contribution

The paper identifies Attention Slipping as a common mechanism in jailbreak attacks and introduces Attention Sharpening, a novel, computationally efficient defense method that counters this phenomenon.

Findings

01

Attention Slipping occurs across various jailbreak methods.

02

Token Highlighter and SmoothLLM indirectly mitigate Attention Slipping.

03

Attention Sharpening effectively resists jailbreaks while preserving benign performance.

Abstract

As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper addresses an important research question by investigating the role of attention mechanisms in jailbreak attacks on LLMs. 2. The paper is exceptionally well-written with clear narrative structure and strong logical flow. 3. Valuable findings and insights regarding the Attention Slipping phenomenon are presented, enhancing understanding of jailbreak mechanisms.

Weaknesses

1. **Limited Conceptual Contribution:** The observation that attention decreases during jailbreak attacks has been previously reported in prior work, which diminishes the technical novelty of this paper. As the authors acknowledge in Section 5, concurrent studies like RobustKV (Jiang et al., 2025) and AttnGCG (Wang et al., 2025b) have already identified that jailbreaks manipulate attention patterns. The primary contribution of this paper is demonstrating that attention slips *gradually* during t

Reviewer 02Rating 4Confidence 4

Strengths

- Cross-attack, cross-model evidence of a single mechanism (relationship between AR and ASR) via both optimization trajectories and reverse masking. - Mechanistic angle that links attacks and defenses at the attention level - Simple, efficient defense that avoids multi-pass overhead and large memory costs claimed for alternatives.

Weaknesses

- In Figure 5, the increase in attention rate with stronger defense strength does not appear very pronounced—especially compared to the degree of ASR reduction—so the claim in L361–362 seems somewhat overstated. - The authors devote substantial space to analyzing and illustrating the Attention Slipping phenomenon; however, as they themselves discuss in Section 5, this phenomenon appears to have been proposed in previous work under different formulations (see L469, “two ways of looking at the sam

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper provides a mechanistic interpretation of jailbreak behavior through “attention slipping,” connecting model interpretability with adversarial robustness. 2. Experiments span four major LLMs and multiple jailbreak families, making the observed phenomenon and proposed defense appear robust and generalizable.

Weaknesses

1. My major concern is although the paper emphasizes its focus on the dynamics of attention slipping rather than static attention redistribution, the mechanism and mitigation strategy (temperature scaling / KV modification) are conceptually similar to RobustKV (ICLR 2025) and related attention-based defenses. The actual distinction feels incremental rather than fundamental. 2. The claim that “attention slipping causes jailbreak success” is largely correlational. There is no controlled intervent

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.