Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

Xingwei Zhong; Kar Wai Fok; Vrizlynn L.L. Thing

arXiv:2510.21214·cs.CR·October 27, 2025

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

Xingwei Zhong, Kar Wai Fok, Vrizlynn L.L. Thing

PDF

TL;DR

This paper introduces advanced black-box jailbreak techniques for multimodal large language models, combining text and image prompts, and proposes improved defense strategies to enhance security against such attacks.

Contribution

It presents novel jailbreak methods involving both text and image prompts and develops new defense strategies for training and inference to counter these attacks.

Findings

01

Jailbreak methods successfully bypass existing defenses

02

New defense strategies improve protection during training and inference

03

Enhanced evaluation framework for multimodal model security

Abstract

Multimodal large language models (MLLMs) comprise of both visual and textual modalities to process vision language tasks. However, MLLMs are vulnerable to security-related issues, such as jailbreak attacks that alter the model's input to induce unauthorized or harmful responses. The incorporation of the additional visual modality introduces new dimensions to security threats. In this paper, we proposed a black-box jailbreak method via both text and image prompts to evaluate MLLMs. In particular, we designed text prompts with provocative instructions, along with image prompts that introduced mutation and multi-image capabilities. To strengthen the evaluation, we also designed a Re-attack strategy. Empirical results show that our proposed work can improve capabilities to assess the security of both open-source and closed-source MLLMs. With that, we identified gaps in existing defense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.