Wukong Framework for Not Safe For Work Detection in Text-to-Image systems
Mingrui Liu, Sixiao Zhang, Cheng Long

TL;DR
Wukong is a transformer-based framework integrated into diffusion models for early, efficient, and accurate NSFW detection in text-to-image generation, leveraging intermediate denoising outputs and cross-attention features.
Contribution
The paper introduces Wukong, a novel method that detects NSFW content during the diffusion process by utilizing intermediate outputs and shared attention parameters, improving efficiency and accuracy.
Findings
Wukong outperforms text-based safeguards in accuracy.
Wukong achieves comparable results to image filters.
Wukong enables early NSFW detection during image generation.
Abstract
Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Hate Speech and Cyberbullying Detection
