The Sufficiency of Off-Policyness and Soft Clipping: PPO is still   Insufficient according to an Off-Policy Measure

Xing Chen; Dongcui Diao; Hechang Chen; Hengshuai Yao; Haiyin Piao,; Zhixiao Sun; Zhiwei Yang; Randy Goebel; Bei Jiang; Yi Chang

arXiv:2205.10047·cs.LG·December 6, 2022

The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure

Xing Chen, Dongcui Diao, Hechang Chen, Hengshuai Yao, Haiyin Piao,, Zhixiao Sun, Zhiwei Yang, Randy Goebel, Bei Jiang, Yi Chang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that PPO's clipped policy space is insufficient by introducing a novel off-policy measure, showing that better policies exist outside PPO's constrained space, and proposing an exploration method that surpasses PPO in policy optimization.

Contribution

The paper introduces a new surrogate objective using sigmoid functions, revealing PPO's limitations and enabling exploration beyond the clipped policy space, improving CPI optimization.

Findings

01

PPO is insufficient in off-policyness according to the DEON metric.

02

The proposed method explores a larger policy space than PPO.

03

Our algorithm outperforms PPO in maximizing the CPI objective during training.

Abstract

The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is ``YES'', and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in ``off-policyness'', according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raincchio/p3o
tfOfficial

Videos

The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure· underline

Taxonomy

TopicsOptimization and Search Problems · Age of Information Optimization

MethodsEntropy Regularization · Proximal Policy Optimization