Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation
Chi Zhang, Changjia Zhu, Xiaowen Li, Yao Liu, Zhuo Lu

TL;DR
This paper introduces DDiffusion, a robust text-to-image diffusion model that detects and suppresses NSFW content by analyzing prompt semantics and locally editing images, improving safety without sacrificing image quality.
Contribution
It proposes a novel semantic retrieval and localization approach to enhance safety in diffusion models, reducing false alarms and vulnerability to adversarial prompts.
Findings
Effectively suppresses NSFW content while maintaining image quality.
Reduces false alarms compared to traditional binary filtering methods.
Enhances robustness against adversarial prompt modifications.
Abstract
Text-to-image (T2I) diffusion models have the ability to build high-quality pictures from text prompts, but they pose safety concerns because they can generate offensive or disturbing imagery when provided with harmful inputs. Existing safety filters typically rely on text-based classifiers or image-based checkers that completely block the output upon detecting a threat, issuing an explicit allow/block feedback signal to the user. This binary strategy leaves models vulnerable to adversarial attacks that alter keywords to bypass detection, and it causes high false-alarm rates that degrade the experience for benign users. To address such vulnerabilities, we propose Disciplined Diffusion (DDiffusion), a novel robust text-to-image diffusion that counters Not Safe For Work (NSFW) generation by uncovering implicit malicious semantics in prompt embeddings. DDiffusion leverages a semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
