SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Komal Kumar; Ankan Deria; Abhishek Basu; Fahad Shamshad; Hisham Cholakkal; Karthik Nandakumar

arXiv:2605.18719·cs.CV·May 19, 2026

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces SafeDiffusion-R1, an online reinforcement learning method that improves diffusion model safety by steering content away from unsafe prompts without requiring supervised unsafe data.

Contribution

It proposes a novel online reward steering mechanism using CLIP embeddings and Group Relative Policy Optimization to enhance safety without catastrophic forgetting.

Findings

01

Reduces inappropriate content to 18.07% from 48.9% in baseline

02

Decreases nudity detections to 15 from 646 baseline

03

Improves generative quality from 42.08% to 47.83% on GenEval

Abstract

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MAXNORM8650/SafeDiffusion-R1
github

Models

🤗
ItsMaxNorm/SafeDiffusion-R1
model

Datasets

ItsMaxNorm/SafeDiffusion-R1-dataset
dataset· 100 dl
100 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.