SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Hongxiang Zhang; Yifeng He; Hao Chen

arXiv:2410.02710·cs.CV·October 7, 2025

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Hongxiang Zhang, Yifeng He, Hao Chen

PDF

Open Access 4 Reviews

TL;DR

SteerDiff introduces a lightweight adaptor that manipulates text embeddings to steer diffusion models away from generating inappropriate content, enhancing safety without sacrificing usability.

Contribution

It presents SteerDiff, a novel method for concept unlearning and safety enforcement in text-to-image diffusion models through embedding manipulation.

Findings

01

Effective in concept unlearning tasks

02

Robust against red-teaming strategies

03

Versatile for concept forgetting

Abstract

Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

The topic of this paper has practical significance.

Weaknesses

1. **The paper has poor writing and formatting quality.** The formula for stable diffusion in Sec. 2.1 is incorrect (this equation is unconditional diffusion, not Stable Diffusion), and Fig. 2 is unclear. 2. Although the results presented in experiments look promising,**the technical contribution is limited.** Similar approaches for guiding models to generate images toward specific targets have already been introduced in prior works [1,2]. 3. **The experiments are insufficient.** It seems that

Reviewer 02Rating 3Confidence 4

Strengths

1. The SteerDiff method introduced in the paper does not require intensive computation and can effectively remove NSFW content and specific content. 2. Experiments demonstrate the effectiveness of the proposed method.

Weaknesses

1. This paper lacks clarity in its presentation, leading to confusion in both the figures and other descriptions. For example, W is introduced in Equation 3 but specifically define in Equation 4; the description of Figure 1 fails to illustrate the significance of the blue blocks; in Figure 2 (a), the Text Encoder and Diffusion Model are placed in a trapezoidal frame, which can easily be misunderstood as the Unet; Figure 2(b) does not differentiate between blue and red points; the text embeddings

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper investigates an important and interesting problem: steering the Text-to-Image diffusion model away from generating unsafe contents. 2. The experimental results demonstrate the effectiveness of SteerDiff in filtering unsafe content while maintaining high image quality and accurately preserving the intended semantics of safe content. 3. The paper is well-written and easy to follow. It has good visualizations of the generated images by SteerDiff.

Weaknesses

1. SteerDiff is quite similar to Latent Guard [1], as both methods rely on LLMs to construct unsafe-safe prompt pairs and train models to detect unsafe concepts in the embedding space. While SteerDiff’s motivation is to steer unsafe concepts into safe regions, this raises the question: is transforming malicious prompts genuinely better than outright rejecting them? The steering module could potentially introduce vulnerabilities, as malicious users might exploit the transformation mechanism to by

Reviewer 04Rating 5Confidence 3

Strengths

- This paper proposes SteerDiff, a lightweight adapter module that serves as an intermediary between user input and the diffusion model. It effectively identifies and manipulates inappropriate concepts within the text embedding space, thereby guiding the model to generate images that comply with ethical and safety standards. - Under various adversarial attacks, such as the white-box attack P4D and the black-box attack SneakyPrompt, SteerDiff demonstrates significant robustness, successfully red

Weaknesses

- The paper primarily focuses on unlearning harmful content. Have you considered attempting to unlearn style and copyright aspects as well? I noticed that ESD includes such experiments. - The paper was mainly conducted on SD1.4. Have you tried applying your approach to SDXL? - The baselines compared in the paper are somewhat limited. There are other unlearning methods available, such as SPM [1], AdvUnlearn [2], and RECE [3], that could be included for a more comprehensive comparison. - Additi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Advanced Steganography and Watermarking Techniques

MethodsSoftmax · Attention Is All You Need · Diffusion