Diffusion Guidance Is a Controllable Policy Improvement Operator
Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine

TL;DR
This paper introduces CFGRL, a diffusion guidance-based policy improvement method that enhances offline reinforcement learning by combining generative modeling with supervised learning, without needing explicit value functions.
Contribution
The paper presents a novel framework, CFGRL, that leverages diffusion guidance for policy improvement, simplifying training and improving performance in offline RL tasks.
Findings
Increased guidance improves policy performance.
CFGRL operates effectively without explicit value functions.
Achieves performance gains by generalizing supervised methods.
Abstract
At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for…
Peer Reviews
Decision·Submitted to ICLR 2026
The central insight—viewing policy improvement as classifier-free guidance over an advantage-conditioned policy—is elegant. It unifies guided diffusion sampling with KL-regularized policy improvement and control-as-inference via a clean product-policy view, and shows that test-time guidance directly tunes the improvement strength. The theory is tidy. The paper also avoids learning an explicit optimality predictor via a Bayes inversion that merges unconditional and optimality-conditioned polici
The paper notes that larger w both improves $ A_{ \hat \pi }$ and deviates more from the dataset policy, possibly hurting performance; the ablation indeed shows performance sometimes declines past a point, but there’s no adaptive or trust-region control of $w$ or measured KL to the prior. For the offline RL part, results are averaged over four seeds; gains are consistent but sometimes modest. The GCBC part uses more seeds, but a wider set of domains and stronger end-to-end RL baselines would f
1. Simplicity and practical appeal – The method requires only standard diffusion training and allows tuning the improvement strength $w$ at inference, offering a practical way to control policy quality without retraining. 2. Solid empirical demonstration – Results on offline RL and goal-conditioned control tasks consistently show improvements over strong baselines such as AWR and GCBC. 3. Readable and well-presented – The paper is clearly written, with theoretical and empirical sections well b
1. Limited novelty beyond reinterpretation The core idea—recasting classifier-free guidance as a policy improvement operator—is conceptually elegant but incremental. The method mainly replaces the continuous classifier (score function) in diffusion guidance with a discrete optimality variable, which is a small modification rather than a fundamentally new algorithmic contribution. Much of the theoretical framing follows directly from existing formulations of advantage-weighted regression and c
1. The paper is well written, and the presentation of results is clear and easy for readers to follow. 2. To the best of my knowledge, this paper is the first to theoretically establish and prove the connection between classifier-free guided diffusion policy sampling and the policy improvement operator in RL. 3. The authors’ analysis of AWR’s weakness in Section 5, together with the experimental observation that CFGRL can sustain larger guidance weights than AWR, constitutes an interesting res
1. The main limitation of this paper lies in that most of its ideas have already appeared independently in prior works. For example, the relationship between classifier-free guidance and weighted regression has been discussed in [1], while the use of classifier-free guidance for policy improvement and the adjustment of different guidance weights was explored in [2]. Although the authors argue that [2] focuses on generating future state sequences whereas CFGRL generates single-step actions, I con
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Domain Adaptation and Few-Shot Learning
MethodsDiffusion
