Diffusion Guidance Is a Controllable Policy Improvement Operator

Kevin Frans; Seohong Park; Pieter Abbeel; Sergey Levine

arXiv:2505.23458·cs.LG·May 30, 2025

Diffusion Guidance Is a Controllable Policy Improvement Operator

Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces CFGRL, a diffusion guidance-based policy improvement method that enhances offline reinforcement learning by combining generative modeling with supervised learning, without needing explicit value functions.

Contribution

The paper presents a novel framework, CFGRL, that leverages diffusion guidance for policy improvement, simplifying training and improving performance in offline RL tasks.

Findings

01

Increased guidance improves policy performance.

02

CFGRL operates effectively without explicit value functions.

03

Achieves performance gains by generalizing supervised methods.

Abstract

At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

The central insight—viewing policy improvement as classifier-free guidance over an advantage-conditioned policy—is elegant. It unifies guided diffusion sampling with KL-regularized policy improvement and control-as-inference via a clean product-policy view, and shows that test-time guidance directly tunes the improvement strength. The theory is tidy. The paper also avoids learning an explicit optimality predictor via a Bayes inversion that merges unconditional and optimality-conditioned polici

Weaknesses

The paper notes that larger w both improves $ A_{ \hat \pi }$ and deviates more from the dataset policy, possibly hurting performance; the ablation indeed shows performance sometimes declines past a point, but there’s no adaptive or trust-region control of $w$ or measured KL to the prior. For the offline RL part, results are averaged over four seeds; gains are consistent but sometimes modest. The GCBC part uses more seeds, but a wider set of domains and stronger end-to-end RL baselines would f

Reviewer 02Rating 4Confidence 3

Strengths

1. Simplicity and practical appeal – The method requires only standard diffusion training and allows tuning the improvement strength $w$ at inference, offering a practical way to control policy quality without retraining. 2. Solid empirical demonstration – Results on offline RL and goal-conditioned control tasks consistently show improvements over strong baselines such as AWR and GCBC. 3. Readable and well-presented – The paper is clearly written, with theoretical and empirical sections well b

Weaknesses

1. Limited novelty beyond reinterpretation The core idea—recasting classifier-free guidance as a policy improvement operator—is conceptually elegant but incremental. The method mainly replaces the continuous classifier (score function) in diffusion guidance with a discrete optimality variable, which is a small modification rather than a fundamentally new algorithmic contribution. Much of the theoretical framing follows directly from existing formulations of advantage-weighted regression and c

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper is well written, and the presentation of results is clear and easy for readers to follow. 2. To the best of my knowledge, this paper is the first to theoretically establish and prove the connection between classifier-free guided diffusion policy sampling and the policy improvement operator in RL. 3. The authors’ analysis of AWR’s weakness in Section 5, together with the experimental observation that CFGRL can sustain larger guidance weights than AWR, constitutes an interesting res

Weaknesses

1. The main limitation of this paper lies in that most of its ideas have already appeared independently in prior works. For example, the relationship between classifier-free guidance and weighted regression has been discussed in [1], while the use of classifier-free guidance for policy improvement and the adjustment of different guidance weights was explored in [2]. Although the authors argue that [2] focuses on generating future state sequences whereas CFGRL generates single-step actions, I con

Code & Models

Repositories

kvfrans/cfgrl
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Domain Adaptation and Few-Shot Learning

MethodsDiffusion