Deep Reinforcement Learning at the Edge of the Statistical Precipice
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville,, Marc G. Bellemare

TL;DR
This paper emphasizes the importance of accounting for statistical uncertainty in deep reinforcement learning evaluations, proposing new metrics and tools to improve the reliability of performance comparisons across benchmarks.
Contribution
It introduces a rigorous statistical evaluation methodology and open-source library to enhance the reliability of deep RL benchmark results, especially with limited runs.
Findings
Significant discrepancies in performance conclusions when ignoring uncertainty.
Interval estimates and robust metrics reduce evaluation variability.
Application of methods reveals prior evaluation inconsistencies.
Abstract
Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Advanced Multi-Objective Optimization Algorithms
MethodsAverage Pooling · Convolution · Dilated Convolution · 1x1 Convolution · Global Average Pooling · Switchable Atrous Convolution · Entropy Regularization · Proximal Policy Optimization · Prioritized Experience Replay · Double Q-learning
