Truly Proximal Policy Optimization
Yuhui Wang, Hao He, Chao Wen, Xiaoyang Tan

TL;DR
Truly PPO introduces a trust region-based clipping mechanism to improve the stability and performance of proximal policy optimization in deep reinforcement learning.
Contribution
It proposes an enhanced PPO algorithm with a new clipping function and trust region-based clipping trigger to ensure monotonic policy improvement.
Findings
Improved sample efficiency over original PPO
Enhanced stability and performance in RL tasks
Guarantees monotonic policy improvement
Abstract
Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the likelihood ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Truly PPO. Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the difference between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, such that optimizing the resulted surrogate objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
MethodsEntropy Regularization · Proximal Policy Optimization
