Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character
Siyuan Ma, Weidi Luo, Yu Wang, Xiaogeng Liu

TL;DR
This paper introduces Visual Role-play (VRP), a novel attack method that uses role-playing images generated by LLMs to effectively deceive multimodal large language models, exposing vulnerabilities in their safety mechanisms.
Contribution
The paper presents VRP, a new, generalizable jailbreak attack leveraging role-play images to mislead MLLMs, outperforming existing methods in attack success rate.
Findings
VRP achieves higher attack success rates than baseline methods.
VRP demonstrates strong generalizability across different models.
Extensive experiments validate VRP's effectiveness and robustness.
Abstract
With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerability of MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks, where harmful semantic content is embedded within images, have been proposed to mislead the models. However, previous structure-based jailbreak methods mainly focus on transforming the format of malicious queries, such as converting harmful content into images through typography, which lacks sufficient jailbreak effectiveness and generalizability. To address these limitations, we first introduce the concept of "Role-play" into MLLM jailbreak attacks and propose a novel and effective method called Visual Role-play (VRP). Specifically, VRP leverages Large Language Models to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
MethodsFocus
