Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao; Ranjie Duan; Fengxiang Wang; Chi Chen; Caixin Kang; Shouwei Ruan; Jialing Tao; YueFeng Chen; Hui Xue; Xingxing Wei

arXiv:2501.04931·cs.CR·June 30, 2025

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei

PDF

Open Access

TL;DR

This paper uncovers a shuffle inconsistency in multimodal large language models that allows for a novel jailbreak attack, SI-Attack, significantly increasing attack success rates on commercial models by exploiting their comprehension-safety discrepancy.

Contribution

The paper introduces a new understanding of shuffle inconsistency in MLLMs and proposes a black-box optimization based jailbreak method, SI-Attack, to effectively bypass safety mechanisms.

Findings

01

SI-Attack improves attack success rates on multiple benchmarks.

02

SI-Attack significantly enhances bypassing safety in commercial MLLMs like GPT-4o and Claude-3.5-Sonnet.

03

Shuffle inconsistency can be exploited to develop more effective jailbreak attacks.

Abstract

Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis