ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models   through Reinforcement Learning

Dong Han; Salaheldin Mohamed; Yong Li

arXiv:2410.05309·cs.CV·October 10, 2024

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Dong Han, Salaheldin Mohamed, Yong Li

PDF

Open Access

TL;DR

This paper introduces ShieldDiff, a reinforcement learning-based fine-tuning approach for diffusion models that effectively suppresses unsafe sexual content generation while maintaining high image quality and semantic relevance.

Contribution

The paper presents a novel reinforcement learning method with a custom reward function to reduce NSFW content in diffusion models without sacrificing image fidelity.

Findings

01

Effective reduction of unsafe content generation.

02

Maintains high image quality and semantic relevance.

03

Outperforms state-of-the-art methods in robustness against adversarial prompts.

Abstract

With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, the generated contents cannot be fully controlled. There is a potential risk that T2I model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model while maintaining the high quality of generated images by fine-tuning the pre-trained diffusion model via reinforcement learning by optimizing the well-designed content-safe reward function. The proposed method leverages a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents that adhere to the pret-rained model and keep the corresponding semantic meaning on the safe side. In this way, the T2I model is robust to unsafe adversarial prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling

MethodsContrastive Language-Image Pre-training · Focus · Diffusion