Beyond the Boundaries of Proximal Policy Optimization

Charlie B. Tan; Edan Toledo; Benjamin Ellis; Jakob N. Foerster; Ferenc; Husz\'ar

arXiv:2411.00666·cs.LG·November 4, 2024

Beyond the Boundaries of Proximal Policy Optimization

Charlie B. Tan, Edan Toledo, Benjamin Ellis, Jakob N. Foerster, Ferenc, Husz\'ar

PDF

Open Access 3 Reviews

TL;DR

This paper reinterprets PPO by decomposing it into separate estimation and application steps, introduces outer-PPO allowing flexible optimizers, and empirically demonstrates improved performance with non-unity learning rates and momentum.

Contribution

It proposes outer-PPO, a new framework that decouples update estimation from application, enabling the use of arbitrary optimizers and challenging PPO's implicit design choices.

Findings

01

Non-unity learning rates improve performance on Brax and Jumanji environments.

02

Momentum applied to the outer loop yields statistically significant gains.

03

Decoupling update estimation and application reveals new optimization strategies for PPO.

Abstract

Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax,…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 2

Strengths

The algorithms are clearly introduced with the help of figure representations. The paper makes claims on its empirical performances, which are well supported by the empirical results.

Weaknesses

The novelty of the algorithm can be explained more. Empirically speaking, outer-PPO with a non-unity learning rate performs best on Brax and Jumaji and functions as the main contribution. However, what is the difference between this proposed algorithm and rescaling the learning rate of the original PPO? Is the empirical result suggesting that stochastic gradient descent can be a better optimizer than the commonly used Adam?

Reviewer 02Rating 6Confidence 4

Strengths

Originality -------------- The idea presented in the paper is, as far as I know, novel. Quality and Clarity ------------------------- The paper presents the idea in a clear way, highlighting the main research questions well and providing a robust empirical analysis of the proposed algorithm (with very detailed ablation studies). The algorithm is sound and coherent with the investigation objective. Significance ---------------- I am unsure about the significance of the proposed idea. The pap

Weaknesses

As I have mentioned above, I think a weakness of the proposed method is the introduction of new hyperparameters (i.e., outer learning rate and momentum) - with what seems to be little payoff. Furthermore, I am unsure whether the learning rate is necessary: the hyperparameter $\epsilon$ of PPO already provides a mechanism to control "how aggressive" the policy updates are. While I can acknowledge that the learning rate and the $\epsilon$ are two different terms (i.e., the learning rage $\sigma$

Reviewer 03Rating 3Confidence 5

Strengths

- The paper is clear. - There do a lot of experiments, and the results are reported faithfully.

Weaknesses

- The performance improvements are at best small (5-10%) - The performance change could also be due to changes in hyperparameter tuning procedures. Currently, the authors tune PPO by optimizing 11 hyperparameters for 600 trials (each trial averaged across 4 seeds) using a Tree Structured Parzen method, and then doing a final evaluation at the best parameters for 64 seeds (different from the initial 4 seeds). And the outer-PPO methods are hyperparameter tuned by first doing 500 trials of PPO hype

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic Policies and Impacts

MethodsEntropy Regularization · Proximal Policy Optimization