JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen; Shiyao Cui; Xuancheng Huang; Chengwei Pan; Victor Shea-Jay Huang; QingLin Zhang; Xuan Ouyang; Zhexin Zhang; Hongning Wang; and Minlie Huang

arXiv:2508.05087·cs.MM·August 8, 2025

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, and Minlie Huang

PDF

TL;DR

JPS introduces a novel method combining visual perturbations and textual steering to effectively jailbreak multimodal large language models, significantly improving attack success and malicious intent fulfillment rates.

Contribution

The paper proposes JPS, a new collaborative visual and textual attack framework that enhances jailbreak success and malicious response quality in multimodal large language models.

Findings

01

JPS achieves state-of-the-art attack success rates across various benchmarks.

02

The Malicious Intent Fulfillment Rate (MIFR) metric effectively measures attack quality.

03

Iterative co-optimization of visual and textual components improves attack performance.

Abstract

Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.