NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation
Yu Xie, Chengjie Zeng, Lingyun Zhang, Yanwei Fu

TL;DR
This paper introduces PromptSan, a novel prompt sanitization method for text-to-image models that effectively reduces harmful content generation without compromising image quality or model performance.
Contribution
It proposes two innovative prompt sanitization techniques, PromptSan-Modify and PromptSan-Suffix, to detoxify prompts and enhance safety in T2I models.
Findings
PromptSan significantly reduces harmful content in generated images.
The methods maintain high image quality and model usability.
PromptSan outperforms existing safety mitigation approaches.
Abstract
The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
