Relative Entropy Regularized Policy Iteration
Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven, Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, Martin Riedmiller

TL;DR
This paper introduces a novel off-policy actor-critic reinforcement learning algorithm that combines stochastic search and value function learning, demonstrating superior performance on multiple continuous control benchmarks.
Contribution
It proposes a flexible, three-step policy iteration method that integrates ideas from black-box optimization and RL inference, extending existing algorithms like MPO and CMA-ES.
Findings
Achieved state-of-the-art results on 31 continuous control tasks.
Demonstrated effectiveness with limited compute and a single hyperparameter set.
Validated the method's versatility across diverse RL environments.
Abstract
We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and 'RL as an inference' and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Control Systems Optimization · Adaptive Dynamic Programming Control
