TL;DR
SafeText is a novel alignment method that fine-tunes the text encoder in text-to-image models to prevent harmful image generation from unsafe prompts, maintaining image quality for safe prompts and outperforming existing methods.
Contribution
SafeText introduces a new approach by fine-tuning the text encoder instead of the diffusion module, effectively reducing harmful outputs while preserving image quality.
Findings
Effectively prevents harmful image generation from unsafe prompts.
Outperforms six existing alignment methods in safety and quality.
Minimal impact on image quality for safe prompts.
Abstract
Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainly focus on modifying the diffusion module to prevent harmful image generation. However, this often significantly impacts the model's behavior for safe prompts, causing substantial quality degradation of generated images. In this work, we propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. By adjusting the text encoder, SafeText significantly alters the embedding vectors for unsafe prompts, while minimally affecting those for safe…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The problem of harmful content generation in text-to-image models is both timely and important, especially as these models become more accessible and integrated into consumer applications. The authors rightly identify the trade-off between safety and utility as a key challenge in model alignment. 2. Most prior alignment approaches focus on modifying the diffusion module, which often degrades image quality for safe prompts. SafeText instead modifies the text encoder, which is a novel angle an
1. The paper argues that modifying the diffusion module leads to degradation in image quality for safe prompts. However, this argument is not quantitatively substantiated with sufficient empirical evidence. It is suggested to provide direct comparisons of image quality degradation across methods, showing how much more the diffusion module-based methods affect safe prompts compared to SafeText. Without such comparisons, the core motivation of the method is weakened. 2. The paper equates "safety"
1. The paper demonstrates a practical advantage of aligning the text encoder over the diffusion module. The extensive experimental results consistently show that the proposed method achieves a superior trade-off, effectively suppressing harmful content while better preserving image quality for safe prompts compared to existing baselines. 2. The work includes a well-designed ablation study. The analysis of different distance metrics and the controlled experiment investigating the impact of em
1. From the designed optimization objective function, this work implies that altering the text embedding of an unsafe prompt will lead to a safe image. However, this only ensures the image generated by altered unsafe-prompt embedding, is different from the original, unsafe one, does not necessarily guarantee that the altered embedding will map to a safe concept. 2. The evaluation of safety is reliant on NudeNet for detecting sexually explicit content. This is a narrow definition of "harmful
1. SafeText only aligns the text encoder, resulting in minimal changes to embeddings of safe prompts while substantially altering those of unsafe ones. 2. SafeText achieves high effectiveness while maintaining strong utility, outperforming six baseline methods on Stable Diffusion v1.4. 3. The method is effective against both manually crafted unsafe prompts and adversarially crafted ones generated by state-of-the-art jailbreak techniques. 4. The authors commit to releasing code and data upon pape
1. The main evaluation focuses on nude/sexually explicit content, with only preliminary verification on violent content. Other unsafe concepts (e.g., hate speech-related images, depictions of dangerous behaviors) lack systematic testing. 2. The method is only tested on Stable Diffusion v1.4; results may not generalize to newer or larger-scale models with distinct text encoders or diffusion module architectures. 3. There is no discussion of computational costs or fine-tuning efficiency in compari
- Important topic - Well written and well structured - Strong performance
- Unclear text encoder - Potential vulnerability to more advanced attacks - Trade-off between NRR and LPIPS - Lack of efficiency comparison
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · Focus
