Generalized Proximal Policy Optimization with Sample Reuse
James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras

TL;DR
This paper introduces a new reinforcement learning algorithm that combines the stability of on-policy methods with the sample efficiency of off-policy methods, supported by theoretical guarantees and empirical results.
Contribution
It develops a theoretically grounded off-policy version of PPO, called Generalized Proximal Policy Optimization with Sample Reuse, balancing stability and efficiency.
Findings
Improved performance over traditional PPO.
Theoretically supported policy improvement guarantees.
Effective sample reuse in off-policy setting.
Abstract
In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Optimization and Search Problems · Machine Learning and Data Classification
