Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments
Michael Beukman, Khimya Khetarpal, Zeyu Zheng, Will Dabney, Jakob Foerster, Michael Dennis, Clare Lyle

TL;DR
This paper identifies the cause of performance plateaus in PPO as poor sample-based loss estimates and demonstrates that scaling to over 1 million parallel environments effectively prevents learning stagnation, leading to significant performance gains.
Contribution
The authors model PPO as stochastic optimization, revealing how increasing parallel environments reduces noise and step size issues, and they scale PPO to over 1 million environments for improved performance.
Findings
Scaling to 1 million environments prevents performance stagnation.
Increasing parallel environments reduces sample estimate noise.
Proper hyperparameter co-scaling is crucial for performance.
Abstract
Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
