ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning
Dong Han, Salaheldin Mohamed, Yong Li

TL;DR
This paper introduces ShieldDiff, a reinforcement learning-based fine-tuning approach for diffusion models that effectively suppresses unsafe sexual content generation while maintaining high image quality and semantic relevance.
Contribution
The paper presents a novel reinforcement learning method with a custom reward function to reduce NSFW content in diffusion models without sacrificing image fidelity.
Findings
Effective reduction of unsafe content generation.
Maintains high image quality and semantic relevance.
Outperforms state-of-the-art methods in robustness against adversarial prompts.
Abstract
With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, the generated contents cannot be fully controlled. There is a potential risk that T2I model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model while maintaining the high quality of generated images by fine-tuning the pre-trained diffusion model via reinforcement learning by optimizing the well-designed content-safe reward function. The proposed method leverages a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents that adhere to the pret-rained model and keep the corresponding semantic meaning on the safe side. In this way, the T2I model is robust to unsafe adversarial prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling
MethodsContrastive Language-Image Pre-training · Focus · Diffusion
