Provably Convergent Policy Optimization via Metric-aware Trust Region   Methods

Jun Song; Niao He; Lijun Ding; Chaoyue Zhao

arXiv:2306.14133·cs.LG·June 27, 2023·2 cites

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Jun Song, Niao He, Lijun Ding, Chaoyue Zhao

PDF

Open Access

TL;DR

This paper introduces Wasserstein and Sinkhorn trust region methods for policy optimization in reinforcement learning, providing theoretical guarantees of convergence and demonstrating improved performance and robustness over existing methods.

Contribution

It proposes novel Wasserstein and Sinkhorn trust region methods that directly optimize policies and offers theoretical convergence guarantees and empirical validation.

Findings

01

WPO guarantees monotonic performance improvement.

02

SPO converges faster and is more sample-efficient.

03

Both methods outperform state-of-the-art policy gradient algorithms.

Abstract

Trust-region methods based on Kullback-Leibler divergence are pervasively used to stabilize policy optimization in reinforcement learning. In this paper, we exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of restricting the policy to a parametric distribution class, we directly optimize the policy distribution and derive their closed-form policy updates based on the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes. Moreover, we prove that with a decaying Lagrangian multiplier to the trust region constraint, both methods converge to global optimality. Experiments across tabular domains,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics