TL;DR
This paper introduces a novel method called Attention Eclipse that manipulates model attention to craft more effective and transferable jailbreak attacks on large language models, revealing vulnerabilities despite safety measures.
Contribution
It presents a new attention-based attack technique that enhances the success rate and transferability of jailbreaks while reducing their computational cost.
Findings
Amplifies success rates of existing jailbreaks like GCG, AutoDAN, ReNeLLM.
Achieves 91.2% ASR on Llama2-7B/AdvBench, outperforming original attacks.
Reduces generation time to less than a third of previous methods.
Abstract
Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSoftmax · Attention Is All You Need
