Mirror Descent Policy Optimization

Manan Tomar; Lior Shani; Yonathan Efroni; Mohammad Ghavamzadeh

arXiv:2005.09814·cs.LG·June 8, 2021·24 cites

Mirror Descent Policy Optimization

Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Mirror Descent Policy Optimization (MDPO), an efficient RL algorithm inspired by mirror descent, which unifies and improves upon existing trust-region methods like TRPO, PPO, and SAC in continuous control tasks.

Contribution

The paper proposes MDPO, a novel RL algorithm based on mirror descent principles, with on-policy and off-policy variants, connecting and enhancing existing trust-region algorithms.

Findings

01

MDPO performs better than or comparable to TRPO, PPO, and SAC.

02

Explicit trust-region constraints are not essential for high performance.

03

MDPO unifies several popular RL algorithms under a common framework.

Abstract

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

manantomar/Mirror-Descent-Policy-Optimization
tfOfficial

Videos

Mirror Descent Policy Optimization· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms · Optimization and Search Problems

MethodsMirror Descent Policy Optimization · Entropy Regularization · Trust Region Policy Optimization · Proximal Policy Optimization