SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar

TL;DR
This paper introduces SafeDiffusion-R1, an online reinforcement learning method that improves diffusion model safety by steering content away from unsafe prompts without requiring supervised unsafe data.
Contribution
It proposes a novel online reward steering mechanism using CLIP embeddings and Group Relative Policy Optimization to enhance safety without catastrophic forgetting.
Findings
Reduces inappropriate content to 18.07% from 48.9% in baseline
Decreases nudity detections to 15 from 646 baseline
Improves generative quality from 42.08% to 47.83% on GenEval
Abstract
Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
