Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras,, Firdaus Janoos, Larry Rudolph, Aleksander Madry

TL;DR
This paper demonstrates that code-level optimizations significantly influence the performance and behavior of deep policy gradient algorithms, specifically PPO and TRPO, highlighting challenges in attributing RL progress.
Contribution
It reveals that implementation details are crucial and often responsible for performance differences in deep RL algorithms, emphasizing the importance of careful attribution.
Findings
Code optimizations account for most of PPO's reward gains over TRPO.
Implementation details fundamentally alter how RL algorithms function.
Performance improvements are often due to auxiliary implementation choices.
Abstract
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate the consequences of "code-level optimizations:" algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm. Seemingly of secondary importance, such optimizations turn out to have a major impact on agent behavior. Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function. These insights show the difficulty and importance of attributing performance gains in deep reinforcement learning. Code for reproducing our results is available at https://github.com/MadryLab/implementation-matters .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Memory and Neural Computing
MethodsEntropy Regularization · Proximal Policy Optimization · Trust Region Policy Optimization
