Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv

TL;DR
This paper introduces a reinforcement learning framework for unlearning specific concepts in text-to-image diffusion models, improving stability and effectiveness over prior methods by using a timestep-aware critic and noisy-step rewards.
Contribution
It proposes a novel RL-based diffusion unlearning approach with a timestep-aware critic and noisy rewards, supporting off-policy reuse and better concept forgetting.
Findings
Achieves comparable or better forgetting than strong baselines.
Maintains image quality and prompt fidelity.
Key components include per-step critics and noisy-conditioned rewards.
Abstract
Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method…
Peer Reviews
Decision·Submitted to ICLR 2026
For many legal and safety reasons, managing proper unlearning techniques through diffusion models is important, and currently not in a perfect state. So exploring new techniques is significant for the progress of the field. I'm not very familiar with the most recent related work in this field, but it sounds like training value model baselines for diffusion model RL is novel (though a very straightforward application of a common RL technique). The method is straightforward, makes sense, and is
Largely, more experimental results would contribute significantly to the point of the paper. - RL training is notoriously unstable, so having at least 3 seeds with error bars in figures 1 and 2 would make me more confident the performance improvement is not just luck. - Include more non-cherry-picked image grids of generated image examples for the different methods. (these two are my largest critiques, my score would likely rise if these were addressed) - In Table 1 I recommend running a few
1. The timestep-aware critic addresses the high-variance problem of sparse rewards in prior RL-for-diffusion methods, leading to more stable training and better credit assignment. 2. The method achieves state-of-the-art unlearning accuracy on object removing tasks.
1. As the target of this paper is for machine unlearning, I didn’t see any specific designs for machine unlearning. The proposed critic seems to be the same as the value function in normal policy gradient for variance reduction techniques. As a result it is more like an RL for diffusion method applied to a specific domain. 2. The value function is also used in methods like DPOK and the proposed method just seems to be more fine-grained such that it is also dependent on the timestep. But there’s
- The paper is well motivated, coherent, with clean notations and algorithmic details. - The paper offers a new perspective on unlearning by reframing diffusion sampling as an actor-critic RL problem, and provides a formal connection between two active areas - diffusion alignment and machine unlearning - in a unified formalism. - Introduction of a per-timestep critic for diffusion policy optimization is a clear algorithmic step forward, especially given the substantial instability and high
- Scope of evaluations: The evaluation depth is limited. While the paper tests 20 object classes, it does not cover a single concept, keeping the scope narrow. The experiments focus only on one model (Stable Diffusion 1.5) and one dataset (UnlearnCanvas), concentrating on specific objects like "Cats" and "Towers" (Appendix D, Table 4). There are no results on more abstract or safety-critical tasks like style removal or identity erasure, which are mentioned as important reasons for unlearning in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Domain Adaptation and Few-Shot Learning
