$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and   Defenses for Multimodal Large Language Models

Fenghua Weng; Yue Xu; Chengyan Fu; Wenjie Wang

arXiv:2408.08464·cs.CR·October 23, 2024

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Fenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces MMJ-Bench, a unified evaluation framework and benchmark for assessing jailbreak attacks and defenses on Multimodal Large Language Models, addressing the lack of standardized comparison methods.

Contribution

It provides the first comprehensive, systematic evaluation pipeline and public benchmark for MLLM jailbreak attacks and defenses, enabling consistent comparison across methods.

Findings

01

Attack methods vary in effectiveness against MLLMs.

02

Defense mechanisms impact model utility and security.

03

The benchmark reveals key directions for future research.

Abstract

As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model's safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunxxx/MLLM-Jailbreak-evaluation-MMJ-bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning