Truly Proximal Policy Optimization

Yuhui Wang; Hao He; Chao Wen; Xiaoyang Tan

arXiv:1903.07940·cs.LG·January 15, 2020·32 cites

Truly Proximal Policy Optimization

Yuhui Wang, Hao He, Chao Wen, Xiaoyang Tan

PDF

Open Access 1 Repo

TL;DR

Truly PPO introduces a trust region-based clipping mechanism to improve the stability and performance of proximal policy optimization in deep reinforcement learning.

Contribution

It proposes an enhanced PPO algorithm with a new clipping function and trust region-based clipping trigger to ensure monotonic policy improvement.

Findings

01

Improved sample efficiency over original PPO

02

Enhanced stability and performance in RL tasks

03

Guarantees monotonic policy improvement

Abstract

Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the likelihood ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Truly PPO. Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the difference between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, such that optimizing the resulted surrogate objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangyuhuix/TrulyPPO
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems

MethodsEntropy Regularization · Proximal Policy Optimization