Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Pedram Zaree; Md Abdullah Al Mamun; Quazi Mishkatul Alam; Yue Dong,; Ihsen Alouani; Nael Abu-Ghazaleh

arXiv:2502.15334·cs.CR·February 24, 2025

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong,, Ihsen Alouani, Nael Abu-Ghazaleh

PDF

1 Video

TL;DR

This paper introduces a novel method called Attention Eclipse that manipulates model attention to craft more effective and transferable jailbreak attacks on large language models, revealing vulnerabilities despite safety measures.

Contribution

It presents a new attention-based attack technique that enhances the success rate and transferability of jailbreaks while reducing their computational cost.

Findings

01

Amplifies success rates of existing jailbreaks like GCG, AutoDAN, ReNeLLM.

02

Achieves 91.2% ASR on Llama2-7B/AdvBench, outperforming original attacks.

03

Reduces generation time to less than a third of previous methods.

Abstract

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment· underline

Taxonomy

MethodsSoftmax · Attention Is All You Need