Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu,, Haoqi Hu, Matt Fredrikson

TL;DR
This paper demonstrates that black-box vision-language models like GPT-4o, Claude, and Gemini are highly vulnerable to transferable targeted adversarial attacks, which can manipulate their visual interpretations and responses.
Contribution
It provides a comprehensive analysis of adversarial transferability to proprietary VLLMs and introduces universal perturbations that can induce specific misinterpretations across multiple models.
Findings
Targeted adversarial examples transfer effectively to proprietary VLLMs.
Universal perturbations can induce consistent misinterpretations across models.
Vulnerabilities are prevalent in state-of-the-art vision-language models.
Abstract
Vision Large Language Models (VLLMs) are increasingly deployed to offer advanced capabilities on inputs comprising both text and images. While prior research has shown that adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts, the extent and effectiveness of such vulnerabilities remain underexplored for VLLMs. We present a comprehensive analysis demonstrating that targeted adversarial examples are highly transferable to widely-used proprietary VLLMs such as GPT-4o, Claude, and Gemini. We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information, such as misinterpreting hazardous content as safe, overlooking sensitive or restricted material, or generating detailed incorrect responses aligned with the attacker's intent. Furthermore, we discover that universal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsSparse Evolutionary Training
