JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large   Language Models against Jailbreak Attacks

Weidi Luo; Siyuan Ma; Xiaogeng Liu; Xiaoyu Guo; Chaowei Xiao

arXiv:2404.03027·cs.CR·November 26, 2024·5 cites

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

PDF

Open Access 1 Repo 4 Datasets

TL;DR

This paper introduces JailBreakV-28K, a comprehensive benchmark dataset to evaluate the robustness of Multimodal Large Language Models against jailbreak attacks, revealing significant vulnerabilities transferred from LLMs.

Contribution

The paper presents JailBreakV-28K, the first benchmark dataset for assessing MLLMs' robustness against jailbreak attacks, including both text and image-based adversarial inputs.

Findings

01

High attack success rate from transferred LLM jailbreak techniques

02

MLLMs are vulnerable to both textual and visual jailbreak attacks

03

Highlights need for improved alignment and robustness in MLLMs

Abstract

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EddyLuo1232/JailBreakV_28K
pytorch

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Digital and Cyber Forensics · Cybercrime and Law Enforcement Studies