Batch size-invariance for policy optimization

Jacob Hilton; Karl Cobbe; John Schulman

arXiv:2110.00641·cs.LG·March 28, 2023

Batch size-invariance for policy optimization

Jacob Hilton, Karl Cobbe, John Schulman

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a method to make policy optimization algorithms like PPO batch size-invariant by decoupling the policy update control from the behavior policy, enabling more efficient use of data.

Contribution

The authors propose a novel approach to achieve batch size-invariance in policy optimization algorithms by decoupling the proximal and behavior policies.

Findings

01

Algorithms can be made batch size-invariant with the proposed method.

02

Decoupling policies improves efficiency in using stale data.

03

Experimental results support the effectiveness of the approach.

Abstract

We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openai/ppo-ewma
pytorchOfficial

Videos

Batch size-invariance for policy optimization· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms