Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
Jingru Li, Wei Ren, Tianqing Zhu

TL;DR
This paper introduces an attention-guided adversarial method to effectively blind large vision-language models to safety instructions, significantly improving attack success rates while reducing convergence time.
Contribution
It proposes a novel attention manipulation technique that circumvents safety alignment, reducing gradient conflicts and enhancing attack efficiency against LVLMs.
Findings
Achieves 94.4% attack success rate on Qwen-VL, outperforming baseline methods.
Reduces gradient conflict by 45%, leading to faster convergence.
Identifies a failure mode called safety blindness, where safety attention is suppressed.
Abstract
Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
