ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm
Hanyong Wang, Menglong Yang

TL;DR
ExO-PPO is a novel reinforcement learning algorithm that combines the stability of on-policy methods with the sample efficiency of off-policy techniques, leading to improved performance across various tasks.
Contribution
This paper introduces ExO-PPO, a new PPO variant that leverages off-policy data using a segmented exponential clipping mechanism and a replay buffer for enhanced efficiency and stability.
Findings
ExO-PPO outperforms PPO and other variants in empirical tests.
ExO-PPO achieves better sample efficiency and stability.
The method demonstrates versatility across different tasks.
Abstract
Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
