A unified view of entropy-regularized Markov decision processes

Gergely Neu; Anders Jonsson; Vicen\c{c} G\'omez

arXiv:1705.07798·cs.LG·May 23, 2017·98 cites

A unified view of entropy-regularized Markov decision processes

Gergely Neu, Anders Jonsson, Vicen\c{c} G\'omez

PDF

Open Access

TL;DR

This paper introduces a unified framework for entropy-regularized reinforcement learning in MDPs, connecting various algorithms through convex regularization and analyzing their convergence properties.

Contribution

It extends policy optimization to convex regularizations, formalizes algorithms as mirror descent variants, and analyzes convergence and empirical effects of regularization.

Findings

01

Exact TRPO converges to the optimal policy.

02

Entropy-regularized policy gradient methods may not converge.

03

Regularization impacts learning performance in simple RL setups.

Abstract

We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control

MethodsTrust Region Policy Optimization