Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses
Xingwei Zhong, Kar Wai Fok, Vrizlynn L.L. Thing

TL;DR
This paper introduces advanced black-box jailbreak techniques for multimodal large language models, combining text and image prompts, and proposes improved defense strategies to enhance security against such attacks.
Contribution
It presents novel jailbreak methods involving both text and image prompts and develops new defense strategies for training and inference to counter these attacks.
Findings
Jailbreak methods successfully bypass existing defenses
New defense strategies improve protection during training and inference
Enhanced evaluation framework for multimodal model security
Abstract
Multimodal large language models (MLLMs) comprise of both visual and textual modalities to process vision language tasks. However, MLLMs are vulnerable to security-related issues, such as jailbreak attacks that alter the model's input to induce unauthorized or harmful responses. The incorporation of the additional visual modality introduces new dimensions to security threats. In this paper, we proposed a black-box jailbreak method via both text and image prompts to evaluate MLLMs. In particular, we designed text prompts with provocative instructions, along with image prompts that introduced mutation and multi-image capabilities. To strengthen the evaluation, we also designed a Re-attack strategy. Empirical results show that our proposed work can improve capabilities to assess the security of both open-source and closed-source MLLMs. With that, we identified gaps in existing defense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
