Dichotomous Diffusion Policy Optimization
Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan

TL;DR
This paper introduces DIPOLE, a novel reinforcement learning algorithm that stabilizes diffusion policy training by decomposing policies into reward-maximizing and minimizing components, enabling flexible control and effective real-world decision-making.
Contribution
DIPOLE proposes a dichotomous policy decomposition and a greedified regularization scheme for stable diffusion policy optimization in reinforcement learning.
Findings
Effective in offline and offline-to-online RL tasks
Enables controllable reward trade-offs during inference
Successfully applied to autonomous driving benchmark
Abstract
Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed dichotomous decomposition of the KL-regularized objective is both elegant and conceptually novel. The analogy to classifier-free guidance (CFG) provides a strong intuitive and theoretical bridge between diffusion modeling and RL optimization. 2. The paper presents comprehensive experiments across multiple RL benchmarks and an ambitious large-scale 1B-parameter VLA model for end-to-end driving, showing clear improvements over strong baselines (IQL, FQL, CFGRL, etc.). 3. The pape
1. The reviewer is a little bit confused about why we need to train a policy that minimizes the rewards. In my opinion, to avoid the large difference between the optimized policy and the behavior policy of offline data, we can directly perform imitation learning on the second diffusion policy rather than minimizing the reward. 2. How can we get $G(s, a)$ in the proposed method? Should we apply some special technique to learn it, such as CQL [R2]? 3. The method can be classified as a weighted-ba
* The paper proposes a simple but effective method (DIPOLE) that trains two diffusion policies instead of one, helping stabilize learning in offline RL. * The method is well-motivated and theoretically justified, avoiding unstable exponential weighting by using bounded scores. * Strong experimental results across many tasks, including large-scale vision-language-action models for autonomous driving. * The paper is clearly written, well-organized, and easy to follow.
* The method is only evaluated in offline or offline-to-online settings. I am not sure why the same idea can't be applied to online RL? * The baselines for comparison seem random to me. Not sure what are the reasons to choose those baselines as opposed to some other diffusion-based / non-diffusion-based offline RL baselines. For example, there are plenty of model-based offline RL baselines and I think the authors primarily only choose model-free baselines. Is this intentional? What are the ratio
This work derives a new closed-form optimal policy under a modified KL objective with a bounded sigmoid weighting, effectively avoiding unstable exponential terms and preventing gradient explosions. The dual-policy decomposition enables learning from both high- and low-reward samples, mitigating data imbalance and overfitting to rare high-reward trajectories. Empirical results demonstrate that DIPOLE consistently outperforms strong baselines across offline, offline-to-online, and large-scale VLA
1. In DIPOLE, the weight term $\sigma(\beta G(s,a))$ (or $1 - \sigma(\beta G(s,a))$) is treated as constant with respect to the diffusion model parameters. This means that the diffusion model learns to denoise under static weighting but does not explicitly learn how to adjust the action distribution to improve $G(s,a)$ directly. Consequently, there is no gradient signal guiding the modification of intermediate noisy actions to increase the expected reward, which may lead to slower convergence or
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Autonomous Vehicle Technology and Safety
