Multimodal Pragmatic Jailbreak on Text-to-image Models
Tong Liu, Zhixin Lai, Jiawen Wang, Gengyuan Zhang, Shuo Chen, Philip Torr, Vera Demberg, Volker Tresp, Jindong Gu

TL;DR
This paper uncovers a new type of jailbreak in text-to-image models where combining safe text and images can produce unsafe content, revealing vulnerabilities in current diffusion models and filters.
Contribution
It introduces a systematic dataset and benchmark for multimodal jailbreaks in T2I models, highlighting their vulnerability and evaluating filter defenses.
Findings
All tested models are susceptible to the jailbreak, with unsafe rates up to 70%.
Common filters fail to detect or prevent the jailbreak.
The jailbreak exploits the models' text rendering and training data biases.
Abstract
Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two closed-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from around 10\% to 70\% where DALLE 3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw in Society and Culture
