Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg, Klimov

TL;DR
Proximal Policy Optimization (PPO) introduces a simple, effective policy gradient method for reinforcement learning that improves sample efficiency and performance across various benchmark tasks.
Contribution
PPO presents a new policy gradient algorithm that allows multiple updates per data sample, combining benefits of trust region methods with simplicity and better empirical sample complexity.
Findings
PPO outperforms other policy gradient methods on benchmark tasks.
PPO achieves a good balance between sample efficiency and computational simplicity.
Empirical results show PPO's effectiveness in robotic locomotion and Atari games.
Abstract
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Adilbai/stock-trading-rl-agentmodel· 251 dl· ♡ 128251 dl♡ 128
- 🤗jingyaogong/minimind-3model· 81 dl· ♡ 181 dl♡ 1
- 🤗rinna/japanese-gpt-neox-3.6b-instruction-ppomodel· 836 dl· ♡ 74836 dl♡ 74
- 🤗rinna/bilingual-gpt-neox-4b-instruction-ppomodel· 15 dl· ♡ 1415 dl♡ 14
- 🤗RichardErkhov/rinna_-_bilingual-gpt-neox-4b-instruction-ppo-4bitsmodel· 2 dl2 dl
- 🤗RichardErkhov/rinna_-_bilingual-gpt-neox-4b-instruction-ppo-8bitsmodel
- 🤗RichardErkhov/rinna_-_japanese-gpt-neox-3.6b-instruction-ppo-4bitsmodel· 5 dl5 dl
- 🤗RichardErkhov/rinna_-_japanese-gpt-neox-3.6b-instruction-ppo-8bitsmodel· 1 dl1 dl
- 🤗tsessk/llm-course-hw2-ppomodel· 1 dl1 dl
- 🤗thsluck/llm-course-hw2-ppomodel· 1 dl1 dl
Videos
An introduction to Policy Gradient methods - Deep Reinforcement Learning· youtube
Taxonomy
Methods07 Easy Ways to Speak With a Live Agent at Priceline Airlines: A Help Guide · Entropy Regularization · Proximal Policy Optimization
