Diffusion Policy through Conditional Proximal Policy Optimization

Ben Liu; Shunpeng Yang; Hua Chen

arXiv:2603.04790·cs.LG·March 6, 2026

Diffusion Policy through Conditional Proximal Policy Optimization

Ben Liu, Shunpeng Yang, Hua Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel on-policy reinforcement learning method that efficiently trains diffusion policies by simplifying likelihood evaluation, enabling diverse behaviors and improved performance on benchmark tasks.

Contribution

It presents a new approach to train diffusion policies efficiently in an on-policy setting by aligning policy iteration with the diffusion process, simplifying likelihood computation.

Findings

01

Produces multimodal policy behaviors

02

Achieves superior performance on benchmark tasks

03

Handles entropy regularization naturally

Abstract

Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The method is simple yet novel: it separates the Flow Matching stage from reinforcement-learning policy improvement, by first training a Teacher policy under a familiar PPO training paradigm and then using Flow Matching to learn that Teacher policy. 2. The ablation studies are thorough: through experiments the paper convincingly shows the importance of both the entropy regularization (which allows the method to outperform FPO) and the score-based regularization (which prevents the Teacher pol

Weaknesses

1. Some experimental descriptions are insufficiently clear. The meaning of “Flow” vs. “Flow + Residual” is ambiguous. The original text only uses the phrase *“diffusion-only (denotes ‘Flow’) policy and the combined policy”* (lines 390–391) without further clarifying exactly what “Residual” constitutes. 2. Details of the training process are missing. Since the method relies on Flow Matching to fit $\pi^{k}$ after $p_{\boldsymbol{\theta}}$, the paper should report: evidence that Flow Matching

Reviewer 02Rating 8Confidence 4

Strengths

The main strength of this paper lies in its novel and significant methodology, which cleverly bypasses the intractable $log \pi(a|s)$ computation in on-policy diffusion training by treating each policy iteration as a denoising step. The method is computationally efficient, with GPU memory occupation comparable to PPO while maintaining reasonable training times. Furthermore, it elegantly solves the difficult entropy regularization problem by optimizing a tractable entropy lower bound, a key featu

Weaknesses

The method has several weaknesses. First, it introduces a policy fitting step (Flow Matching) after the optimization step (CPPO), which creates an approximation error whose cumulative impact on convergence is unassessed. Second, the algorithm relies heavily on an EMA approximation to ensure monotonic improvement, which is not a theoretical guarantee and may fail if policy updates are too large.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper addresses a key and challenging problem—applying diffusion policies in on-policy reinforcement learning. 2. The proposed training framework is both efficient and elegant, as it only requires optimizing the Gaussian residual policy and training diffusion models using the simple flow matching loss, thereby avoiding the intractable computation of diffusion model log-likelihoods.

Weaknesses

1. The paper would be strengthened by including a comparison with DPPO[1], which also employs an on-policy RL algorithm (PPO) to train diffusion policies. 2. Moving the learning curves from the appendix to the main text would provide a clearer comparison of performance against baseline methods.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Artificial Intelligence in Games