Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step
Wenxuan Wang, Kuiyi Gao, Youliang Yuan, Jen-tse Huang, Qiuzhi Liu, Shuai Wang, Wenxiang Jiao, Zhaopeng Tu

TL;DR
This paper introduces the Chain-of-Jailbreak attack, a step-by-step method to bypass safety measures in image generation models, and proposes a defense strategy that effectively mitigates this attack, advancing AI safety research.
Contribution
The paper presents a novel multi-step jailbreaking attack for image generation models and a prompt-based defense, significantly improving safety evaluation and mitigation techniques.
Findings
CoJ attack bypasses safeguards in over 60% of cases
Proposed defense reduces successful attacks to over 95%
Constructed CoJ-Bench dataset for safety scenario evaluation
Abstract
Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness…
Peer Reviews
Decision·Submitted to ICLR 2025
Innovative Methodology: The introduction of the Chain-of-Jailbreak (CoJ) attack is a significant advancement. By decomposing malicious queries into harmless sub-queries and using iterative editing, the paper presents a novel approach to bypassing safeguards in text-based image generation models. Comprehensive Evaluation: The authors have conducted extensive experiments across multiple models (GPT-4V, GPT-4o, Gemini 1.5, and Gemini 1.5 Pro) and scenarios. This thorough evaluation demonstrates th
Method Robustness: The authors propose using Edit Operations and Edit Elements to break down a malicious query into sub-queries. In the implementation, they manually apply this approach to five seed queries and leverage a large language model (LLM) to generalize these examples to other queries. However, my concern is whether the model consistently adheres to the principles of the proposed Edit Operations and Edit Elements. It would be helpful if the authors could elaborate on the reliability of
Safety of text-to-image models is an important topic. A method is proposed to generate a specific type of harmful images. A benchmark is collected.
The scope is a little bit limited. Only images with harmful words can be generated by this method. This should be made clear from paper title and abstract. To generate such images, more straightforward approach may be applicable. E.g., directly merge harmful words with images. Note that, an attacker jailbreaks a GenAI model to generate harmful images, and still needs to propagate them to cause real harms for other people. To generate the types of harmful images considered in this work, an atta
1. The paper proposes a novel yet simple approach to bypass image generation model safeguards, which are known to be built stronger than open-source models such as Stable Diffusion. It also demonstrated a huge outperformance against previous jailbreaking methods in text-to-image generation. 2. The paper also introduces the edit operation in the decomposition process is structured systematically rather than simply tokenizing the input prompt. The diversity of the attack method showed the effect
1. The defense method is not clearly explained and not realistic enough. How and when are the defense prompts inputted to the image generation? 2. A threat model of this type of attack is not specified. When are these attacks be utilized in a real-life scenario?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis
MethodsFocus · Diffusion
