V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete   and Continuous Control

H. Francis Song; Abbas Abdolmaleki; Jost Tobias Springenberg; Aidan; Clark; Hubert Soyer; Jack W. Rae; Seb Noury; Arun Ahuja; Siqi Liu; Dhruva; Tirumala; Nicolas Heess; Dan Belov; Martin Riedmiller; Matthew M. Botvinick

arXiv:1909.12238·cs.AI·September 27, 2019·39 cites

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

H. Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan, Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva, Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick

PDF

Open Access 1 Repo

TL;DR

V-MPO is a novel on-policy reinforcement learning algorithm that improves stability and performance in discrete and continuous control tasks by using a policy iteration approach based on a learned value function, outperforming prior methods.

Contribution

Introduces V-MPO, an on-policy adaptation of MPO that eliminates the need for importance weighting and entropy regularization, achieving state-of-the-art results across multiple benchmarks.

Findings

01

V-MPO surpasses previous scores on Atari-57 and DMLab-30 benchmarks.

02

V-MPO achieves higher scores on individual DMLab and Atari levels.

03

V-MPO effectively controls high-dimensional humanoid robots and OpenAI Gym tasks.

Abstract

Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YYCAAA/V-MPO_Lunarlander
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Zebrafish Biomedical Research Applications · Single-cell and spatial transcriptomics