Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Mykola Vysotskyi; Zahar Kohut; Mariia Shpir; Taras Rumezhak; Volodymyr Karpiv

arXiv:2601.03213·cs.LG·February 17, 2026

Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a reinforcement learning framework for unlearning specific concepts in text-to-image diffusion models, improving stability and effectiveness over prior methods by using a timestep-aware critic and noisy-step rewards.

Contribution

It proposes a novel RL-based diffusion unlearning approach with a timestep-aware critic and noisy rewards, supporting off-policy reuse and better concept forgetting.

Findings

01

Achieves comparable or better forgetting than strong baselines.

02

Maintains image quality and prompt fidelity.

03

Key components include per-step critics and noisy-conditioned rewards.

Abstract

Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 2

Strengths

For many legal and safety reasons, managing proper unlearning techniques through diffusion models is important, and currently not in a perfect state. So exploring new techniques is significant for the progress of the field. I'm not very familiar with the most recent related work in this field, but it sounds like training value model baselines for diffusion model RL is novel (though a very straightforward application of a common RL technique). The method is straightforward, makes sense, and is

Weaknesses

Largely, more experimental results would contribute significantly to the point of the paper. - RL training is notoriously unstable, so having at least 3 seeds with error bars in figures 1 and 2 would make me more confident the performance improvement is not just luck. - Include more non-cherry-picked image grids of generated image examples for the different methods. (these two are my largest critiques, my score would likely rise if these were addressed) - In Table 1 I recommend running a few

Reviewer 02Rating 4Confidence 4

Strengths

1. The timestep-aware critic addresses the high-variance problem of sparse rewards in prior RL-for-diffusion methods, leading to more stable training and better credit assignment. 2. The method achieves state-of-the-art unlearning accuracy on object removing tasks.

Weaknesses

1. As the target of this paper is for machine unlearning, I didn’t see any specific designs for machine unlearning. The proposed critic seems to be the same as the value function in normal policy gradient for variance reduction techniques. As a result it is more like an RL for diffusion method applied to a specific domain. 2. The value function is also used in methods like DPOK and the proposed method just seems to be more fine-grained such that it is also dependent on the timestep. But there’s

Reviewer 03Rating 4Confidence 3

Strengths

- The paper is well motivated, coherent, with clean notations and algorithmic details. - The paper offers a new perspective on unlearning by reframing diffusion sampling as an actor-critic RL problem, and provides a formal connection between two active areas - diffusion alignment and machine unlearning - in a unified formalism. - Introduction of a per-timestep critic for diffusion policy optimization is a clear algorithmic step forward, especially given the substantial instability and high

Weaknesses

- Scope of evaluations: The evaluation depth is limited. While the paper tests 20 object classes, it does not cover a single concept, keeping the scope narrow. The experiments focus only on one model (Stable Diffusion 1.5) and one dataset (UnlearnCanvas), concentrating on specific objects like "Cats" and "Towers" (Appendix D, Table 4). There are no results on more abstract or safety-critical tasks like style removal or identity erasure, which are mentioned as important reasons for unlearning in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Domain Adaptation and Few-Shot Learning