Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa; Zihan Ye; Yang Lu; Michael P Pound; Shreyank N Gowda

arXiv:2604.01888·cs.CV·April 3, 2026

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda

PDF

TL;DR

This paper reveals that current text-to-image safety filters are vulnerable to simple, prompt-based jailbreak attacks that can bypass safeguards with high success rates, exposing a gap in semantic understanding.

Contribution

The authors systematically categorize and evaluate low-effort prompt-based jailbreak techniques that bypass safety filters in state-of-the-art text-to-image models.

Findings

01

Attack success rate up to 74.47% across models

02

Simple linguistic modifications can reliably evade safeguards

03

Proposed taxonomy of visual jailbreak strategies

Abstract

Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.