Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Michael Beukman; Khimya Khetarpal; Zeyu Zheng; Will Dabney; Jakob Foerster; Michael Dennis; Clare Lyle

arXiv:2603.06009·cs.LG·March 9, 2026

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Michael Beukman, Khimya Khetarpal, Zeyu Zheng, Will Dabney, Jakob Foerster, Michael Dennis, Clare Lyle

PDF

Open Access

TL;DR

This paper identifies the cause of performance plateaus in PPO as poor sample-based loss estimates and demonstrates that scaling to over 1 million parallel environments effectively prevents learning stagnation, leading to significant performance gains.

Contribution

The authors model PPO as stochastic optimization, revealing how increasing parallel environments reduces noise and step size issues, and they scale PPO to over 1 million environments for improved performance.

Findings

01

Scaling to 1 million environments prevents performance stagnation.

02

Increasing parallel environments reduces sample estimate noise.

03

Proper hyperparameter co-scaling is crucial for performance.

Abstract

Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)