Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities
Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao

TL;DR
This paper introduces Contextual Image Attack (CIA), a novel image-centric method that exploits visual context to effectively jailbreak multimodal large language models, revealing vulnerabilities in their safety alignment.
Contribution
The paper presents a new image-focused attack approach using multi-agent systems and visualization strategies to embed harmful queries, surpassing prior text-image interaction methods.
Findings
CIA achieves high toxicity scores of 4.73 and 4.83 against GPT-4o and Qwen2.5-VL-72B.
Attack Success Rates reach 86.31% and 91.07%.
Outperforms previous methods in exposing safety vulnerabilities.
Abstract
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Hate Speech and Cyberbullying Detection
