Learning to Constrain Policy Optimization with Virtual Trust Region

Hung Le; Thommen Karimpanal George; Majid Abdolshah; Dung Nguyen; Kien; Do; Sunil Gupta; Svetha Venkatesh

arXiv:2204.09315·cs.LG·September 19, 2022

Learning to Constrain Policy Optimization with Virtual Trust Region

Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien, Do, Sunil Gupta, Svetha Venkatesh

PDF

Open Access 1 Video

TL;DR

This paper presents MCPO, a new reinforcement learning method that uses virtual trust regions built from past policies to improve policy updates, especially when previous policies perform poorly, showing strong results across various tasks.

Contribution

Introducing a dynamic virtual trust region mechanism in policy optimization that leverages past policies for improved reinforcement learning performance.

Findings

01

MCPO outperforms recent on-policy constrained methods in diverse environments.

02

The virtual trust region mechanism adapts effectively during training.

03

Memory-based virtual policies enhance policy stability and learning efficiency.

Abstract

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning to Constrain Policy Optimization with Virtual Trust Region· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Robotic Locomotion and Control · Human Pose and Action Recognition