AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention   Manipulation

Zijun Wang; Haoqin Tu; Jieru Mei; Bingchen Zhao; Yisen Wang; Cihang; Xie

arXiv:2410.09040·cs.CL·October 14, 2024

AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang, Xie

PDF

Open Access 1 Repo

TL;DR

This paper introduces AttnGCG, a novel attention manipulation technique that significantly improves jailbreaking attack success rates on large language models by targeting their internal attention mechanisms.

Contribution

We propose AttnGCG, an attention score manipulation method that enhances the effectiveness of jailbreaking attacks on LLMs, demonstrating improved success and transferability.

Findings

01

AttnGCG achieves ~7% and ~10% higher attack success rates on Llama-2 and Gemma models.

02

The method improves attack transferability to unseen goals and black-box models like GPT-3.5 and GPT-4.

03

Attention visualization provides better interpretability of the attack mechanism.

Abstract

This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsc-vlaa/attngcg-attack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Attention Is All You Need · Linear Layer · Weight Decay · Label Smoothing · Position-Wise Feed-Forward Layer · Cosine Annealing