Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying; Aishan Liu; Tianyuan Zhang; Zhengmin Yu; Siyuan Liang,; Xianglong Liu; Dacheng Tao

arXiv:2406.04031·cs.CV·July 2, 2024·1 cites

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang,, Xianglong Liu, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces BAP, a novel bi-modal adversarial prompt attack that effectively bypasses safety guardrails in vision-language models by optimizing both visual and textual prompts, revealing safety vulnerabilities.

Contribution

The paper presents a new bi-modal attack method that jointly optimizes visual and textual prompts to successfully jailbreak large vision language models, outperforming existing approaches.

Findings

01

Achieves +29.03% higher attack success rate on average.

02

Effective against black-box commercial LVLMs like Gemini and ChatGLM.

03

Significantly outperforms previous methods in robustness and success rate.

Abstract

In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NY1024/BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Forensic and Genetic Research

MethodsFocus