Relative Entropy Regularized Policy Iteration

Abbas Abdolmaleki; Jost Tobias Springenberg; Jonas Degrave; Steven; Bohez; Yuval Tassa; Dan Belov; Nicolas Heess; Martin Riedmiller

arXiv:1812.02256·cs.LG·December 7, 2018·45 cites

Relative Entropy Regularized Policy Iteration

Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven, Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, Martin Riedmiller

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel off-policy actor-critic reinforcement learning algorithm that combines stochastic search and value function learning, demonstrating superior performance on multiple continuous control benchmarks.

Contribution

It proposes a flexible, three-step policy iteration method that integrates ideas from black-box optimization and RL inference, extending existing algorithms like MPO and CMA-ES.

Findings

01

Achieved state-of-the-art results on 31 continuous control tasks.

02

Demonstrated effectiveness with limited compute and a single hyperparameter set.

03

Validated the method's versatility across diverse RL environments.

Abstract

We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and 'RL as an inference' and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

acyclics/MPO
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Control Systems Optimization · Adaptive Dynamic Programming Control