TL;DR
Q-learning with Adjoint Matching (QAM) introduces a stable, unbiased method for optimizing expressive diffusion policies in continuous-action reinforcement learning, outperforming prior approaches on challenging tasks.
Contribution
QAM leverages adjoint matching to enable stable, gradient-based optimization of flow and diffusion policies, overcoming numerical instability issues in continuous RL.
Findings
QAM outperforms prior methods on sparse reward tasks
QAM provides unbiased, expressive policies at the optimum
QAM is effective in both offline and offline-to-online RL settings
Abstract
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
