JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

TL;DR
This paper introduces JailBreakV-28K, a comprehensive benchmark dataset to evaluate the robustness of Multimodal Large Language Models against jailbreak attacks, revealing significant vulnerabilities transferred from LLMs.
Contribution
The paper presents JailBreakV-28K, the first benchmark dataset for assessing MLLMs' robustness against jailbreak attacks, including both text and image-based adversarial inputs.
Findings
High attack success rate from transferred LLM jailbreak techniques
MLLMs are vulnerable to both textual and visual jailbreak attacks
Highlights need for improved alignment and robustness in MLLMs
Abstract
With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Digital and Cyber Forensics · Cybercrime and Law Enforcement Studies
