Transductive Off-policy Proximal Policy Optimization

Yaozhong Gan; Renye Yan; Xiaoyang Tan; Zhe Wu; Junliang Xing

arXiv:2406.03894·cs.LG·June 7, 2024

Transductive Off-policy Proximal Policy Optimization

Yaozhong Gan, Renye Yan, Xiaoyang Tan, Zhe Wu, Junliang Xing

PDF

Open Access

TL;DR

This paper introduces Transductive Off-policy PPO (ToPPO), an off-policy extension of PPO that leverages data from different policies, supported by theoretical guarantees and validated through experiments on six tasks.

Contribution

It presents the first off-policy formulation of PPO with a new policy improvement bound and an efficient optimization method, enhancing data efficiency and performance.

Findings

01

ToPPO outperforms standard PPO on six tasks.

02

Theoretical guarantees ensure safe off-policy data use.

03

Efficient optimization maintains monotonic policy improvement.

Abstract

Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO's promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems · Age of Information Optimization

MethodsEntropy Regularization · Proximal Policy Optimization