Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language   Models via Role-playing Image Character

Siyuan Ma; Weidi Luo; Yu Wang; Xiaogeng Liu

arXiv:2405.20773·cs.CR·June 13, 2024

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

Siyuan Ma, Weidi Luo, Yu Wang, Xiaogeng Liu

PDF

Open Access

TL;DR

This paper introduces Visual Role-play (VRP), a novel attack method that uses role-playing images generated by LLMs to effectively deceive multimodal large language models, exposing vulnerabilities in their safety mechanisms.

Contribution

The paper presents VRP, a new, generalizable jailbreak attack leveraging role-play images to mislead MLLMs, outperforming existing methods in attack success rate.

Findings

01

VRP achieves higher attack success rates than baseline methods.

02

VRP demonstrates strong generalizability across different models.

03

Extensive experiments validate VRP's effectiveness and robustness.

Abstract

With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerability of MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks, where harmful semantic content is embedded within images, have been proposed to mislead the models. However, previous structure-based jailbreak methods mainly focus on transforming the format of malicious queries, such as converting harmful content into images through typography, which lacks sufficient jailbreak effectiveness and generalizability. To address these limitations, we first introduce the concept of "Role-play" into MLLM jailbreak attacks and propose a novel and effective method called Visual Role-play (VRP). Specifically, VRP leverages Large Language Models to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling

MethodsFocus