PolyJailbreak: Cross-Modal Jailbreaking Attacks on Black-Box Multimodal LLMs
Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Guangquan Xu, Shouling Ji

TL;DR
This paper reveals vulnerabilities in multimodal large language models (MLLMs) to black-box jailbreak attacks, introduces PolyJailbreak—a structured, multi-agent attack framework—and demonstrates its effectiveness across various models, exposing safety weaknesses.
Contribution
The paper presents PolyJailbreak, a novel structured framework for black-box multimodal jailbreak attacks, leveraging a library of primitives and reinforcement learning to improve attack success rates.
Findings
PolyJailbreak achieves over 95% success on commercial models.
It outperforms existing jailbreak methods by 18.15% on average.
Visual inputs can disrupt cross-modal safety constraints.
Abstract
Multimodal large language models (MLLMs) have become integral to a wide range of real-world applications by jointly reasoning over text and visual inputs. However, despite recent advances in safety alignment, MLLMs remain vulnerable to jailbreak attacks, where carefully crafted inputs can bypass safety mechanisms and elicit harmful responses. In this work, we investigate the security vulnerabilities of MLLMs in text-vision scenarios and propose a novel black-box jailbreak framework, named PolyJailbreak. We first identify a phenomenon, termed multimodal safety asymmetry, where visual alignment introduces uneven safety constraints across modalities and weakens overall robustness. We analyze attention dynamics and latent representations in MLLMs, revealing that visual inputs can disrupt cross-modal information flow and reduce the model's ability to separate benign and malicious intents.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Topic Modeling
