Mirror Descent Policy Optimization
Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh

TL;DR
This paper introduces Mirror Descent Policy Optimization (MDPO), an efficient RL algorithm inspired by mirror descent, which unifies and improves upon existing trust-region methods like TRPO, PPO, and SAC in continuous control tasks.
Contribution
The paper proposes MDPO, a novel RL algorithm based on mirror descent principles, with on-policy and off-policy variants, connecting and enhancing existing trust-region algorithms.
Findings
MDPO performs better than or comparable to TRPO, PPO, and SAC.
Explicit trust-region constraints are not essential for high performance.
MDPO unifies several popular RL algorithms under a common framework.
Abstract
Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms · Optimization and Search Problems
MethodsMirror Descent Policy Optimization · Entropy Regularization · Trust Region Policy Optimization · Proximal Policy Optimization
