Wasserstein Policy Optimization

David Pfau; Ian Davies; Diana Borsa; Joao G. M. Araujo; Brendan Tracey; and Hado van Hasselt

arXiv:2505.00663·cs.LG·May 2, 2025

Wasserstein Policy Optimization

David Pfau, Ian Davies, Diana Borsa, Joao G. M. Araujo, Brendan Tracey, and Hado van Hasselt

PDF

1 Video

TL;DR

Wasserstein Policy Optimization (WPO) is a novel reinforcement learning algorithm that leverages Wasserstein gradient flow to improve policy updates in continuous action spaces, combining advantages of deterministic and stochastic policy methods.

Contribution

The paper introduces WPO, a new actor-critic algorithm derived from Wasserstein gradient flow, offering a simple, general, and effective policy update method for continuous control tasks.

Findings

01

WPO performs favorably on DeepMind Control Suite benchmarks.

02

WPO demonstrates effective learning in a magnetic confinement fusion task.

03

WPO combines benefits of deterministic and stochastic policy gradient methods.

Abstract

We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Wasserstein Policy Optimization· slideslive