Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye

TL;DR
This paper introduces attention-based metrics and strategies to analyze, attack, and defend large language models by manipulating their attention distributions, revealing vulnerabilities and proposing robustness enhancements.
Contribution
It proposes novel attention metrics and the ABA/ABD strategies, combining analysis, attack, and defense methods to improve understanding and security of LLMs.
Findings
ABA effectively diverts attention to induce harmful outputs.
ABD enhances LLM robustness by calibrating attention distribution.
Attention distribution significantly impacts LLM output quality.
Abstract
Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy "Feint and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The topic is important for LLM security. 2. The attention-based analysis is interesting.
1. The preliminary analysis in Section 2 lacks sufficient breadth to fully support the authors' arguments and the attack and defense method design. 2. The compared jailbreak methods are all proposed in 2023 and may not reflect the latest developments in the field. There are many jailbreaking attacks proposed in 2024, and they work well. 3. Many descriptions lack detail, leading to potential confusion. 4. Lack of comparison with other defense methods.
Novel findings: The finding of the attention mechanisms around jailbreak attacks is novel. And it offers a new perspective on how to optimize the jailbreak attacks and defenses in the future. High effectiveness: The Attention Based Attacks seems to outperform current baselines in jailbreak attacks in terms of attack success rate. Thorough analysis: The authors introduce three different metrics for analyzing the relationships between attention and jailbreak attacks. The new metrics offer new
ABD weakness 1: This jailbreak defense is built on an attention-based risk score. And as mentioned in the paper, a suitable threshold is the foundation of ABD. In general, a good metric/threshold comprises two components: high True Positive Rate (TPR) and lower False Positive Rate (FPR). Table 3 shows that ABD can have pretty high TPR when testing on datasets containing evil targets only. I am curious to see if the TPR will drop or the FPR will rise when you mix in 50% benign prompts into testi
1. This is the first work in analyzing jailbreak attacks through attention mechanisms, providing a fresh approach to understanding how these attacks work at a model behavior level. 2. Comprehensive attack success rates across multiple models, showing strong empirical results on both open and closed source models including the latest LLMs like GPT-4 and Claude-3.
1. The paper's claimed correlation between attention metrics and Attack Success Rate (ASR) is not well supported by the empirical results. For instance, in Llama2-7B, TAP and DeepInception have nearly identical Attn_SensWords values (0.0089 vs 0.0087) yet their ASR differs dramatically (0.30 vs 0.69), demonstrating that these metrics may not be reliable indicators of attack effectiveness. The authors should address such contradictory examples and provide more rigorous statistical analysis to sup
- The work characterizes jailbreak attacks from the perspective of attention weights, which is overlooked in previous studies. - The proposed attack outperforms a set of existing attacks in terms of ASR and number of queries. - The paper is overall well-written and structured.
- The proposed metrics and methods lack any theoretical justification. Why are these particular metrics chosen? What are their connections/differences? While the proposed beam search algorithm for prompt refinement is interesting, it could benefit from a more detailed analysis of its convergence properties. - The proposed defense (ABD) seems rather heuristic. Why is Attn SensWords not used in the formula? How to set the hyper-parameter $\sigma$ optimally? Further, can the adversary develop an a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, AI, and Intellectual Property
MethodsSoftmax · Attention Is All You Need
