Transductive Off-policy Proximal Policy Optimization
Yaozhong Gan, Renye Yan, Xiaoyang Tan, Zhe Wu, Junliang Xing

TL;DR
This paper introduces Transductive Off-policy PPO (ToPPO), an off-policy extension of PPO that leverages data from different policies, supported by theoretical guarantees and validated through experiments on six tasks.
Contribution
It presents the first off-policy formulation of PPO with a new policy improvement bound and an efficient optimization method, enhancing data efficiency and performance.
Findings
ToPPO outperforms standard PPO on six tasks.
Theoretical guarantees ensure safe off-policy data use.
Efficient optimization maintains monotonic policy improvement.
Abstract
Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO's promising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Age of Information Optimization
MethodsEntropy Regularization · Proximal Policy Optimization
