Efficient Online Reinforcement Learning for Diffusion Policy
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai

TL;DR
This paper introduces Reweighted Score Matching, a novel method enabling efficient online reinforcement learning with diffusion policies, avoiding costly sampling and improving performance on MuJoCo benchmarks.
Contribution
The paper proposes Reweighted Score Matching for diffusion policies, enabling scalable online RL without sampling from the target distribution, and introduces two algorithms, DPMD and SDAC.
Findings
DPMD improves over 120% on Humanoid and Ant tasks.
Proposed algorithms outperform recent diffusion-policy online RL methods.
Reweighted Score Matching reduces computational costs and stabilizes training.
Abstract
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks Stability and Synchronization
