White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang; Xingjun Ma; Hanxu Zhou; Chuanjun Ji; Guangnan Ye; Yu-Gang; Jiang

arXiv:2405.17894·cs.CV·October 15, 2024

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang, Jiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a dual-modality attack method that jointly manipulates images and text to exploit vulnerabilities in large vision-language models, successfully bypassing defenses and generating harmful content.

Contribution

It presents a novel universal attack strategy, the Universal Master Key, that effectively jailbreaks VLMs by jointly optimizing adversarial images and texts, revealing critical robustness weaknesses.

Findings

01

Achieves a 96% success rate in jailbreaking MiniGPT-4.

02

Demonstrates vulnerability of VLMs to combined image-text adversarial attacks.

03

Highlights the need for improved alignment and robustness strategies.

Abstract

Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roywang021/UMK
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques