Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Wenxuan Wang; Kuiyi Gao; Youliang Yuan; Jen-tse Huang; Qiuzhi Liu; Shuai Wang; Wenxiang Jiao; Zhaopeng Tu

arXiv:2410.03869·cs.CL·June 4, 2025

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Wenxuan Wang, Kuiyi Gao, Youliang Yuan, Jen-tse Huang, Qiuzhi Liu, Shuai Wang, Wenxiang Jiao, Zhaopeng Tu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Chain-of-Jailbreak attack, a step-by-step method to bypass safety measures in image generation models, and proposes a defense strategy that effectively mitigates this attack, advancing AI safety research.

Contribution

The paper presents a novel multi-step jailbreaking attack for image generation models and a prompt-based defense, significantly improving safety evaluation and mitigation techniques.

Findings

01

CoJ attack bypasses safeguards in over 60% of cases

02

Proposed defense reduces successful attacks to over 95%

03

Constructed CoJ-Bench dataset for safety scenario evaluation

Abstract

Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

Innovative Methodology: The introduction of the Chain-of-Jailbreak (CoJ) attack is a significant advancement. By decomposing malicious queries into harmless sub-queries and using iterative editing, the paper presents a novel approach to bypassing safeguards in text-based image generation models. Comprehensive Evaluation: The authors have conducted extensive experiments across multiple models (GPT-4V, GPT-4o, Gemini 1.5, and Gemini 1.5 Pro) and scenarios. This thorough evaluation demonstrates th

Weaknesses

Method Robustness: The authors propose using Edit Operations and Edit Elements to break down a malicious query into sub-queries. In the implementation, they manually apply this approach to five seed queries and leverage a large language model (LLM) to generalize these examples to other queries. However, my concern is whether the model consistently adheres to the principles of the proposed Edit Operations and Edit Elements. It would be helpful if the authors could elaborate on the reliability of

Reviewer 02Rating 5Confidence 5

Strengths

Safety of text-to-image models is an important topic. A method is proposed to generate a specific type of harmful images. A benchmark is collected.

Weaknesses

The scope is a little bit limited. Only images with harmful words can be generated by this method. This should be made clear from paper title and abstract. To generate such images, more straightforward approach may be applicable. E.g., directly merge harmful words with images. Note that, an attacker jailbreaks a GenAI model to generate harmful images, and still needs to propagate them to cause real harms for other people. To generate the types of harmful images considered in this work, an atta

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper proposes a novel yet simple approach to bypass image generation model safeguards, which are known to be built stronger than open-source models such as Stable Diffusion. It also demonstrated a huge outperformance against previous jailbreaking methods in text-to-image generation. 2. The paper also introduces the edit operation in the decomposition process is structured systematically rather than simply tokenizing the input prompt. The diversity of the attack method showed the effect

Weaknesses

1. The defense method is not clearly explained and not realistic enough. How and when are the defense prompts inputted to the image generation? 2. A threat model of this type of attack is not specified. When are these attacks be utilized in a real-life scenario?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis

MethodsFocus · Diffusion