SiMPO: Measure Matching for Online Diffusion Reinforcement Learning
Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai

TL;DR
SiMPO introduces a unified measure matching framework for diffusion RL that incorporates negative reweighting, leading to improved policy optimization and performance.
Contribution
It generalizes diffusion RL with a measure matching approach that allows for signed measures and negative reweighting, providing theoretical insights and practical benefits.
Findings
SiMPO outperforms existing methods in empirical evaluations.
Negative reweighting helps avoid suboptimal actions.
The framework offers flexible reweighting schemes tailored to reward landscapes.
Abstract
A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by -divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Adaptive Dynamic Programming Control
