Value-Based Deep RL Scales Predictably
Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar

TL;DR
This paper demonstrates that value-based off-policy reinforcement learning methods exhibit predictable scaling behavior with data and compute, enabling accurate performance extrapolation and optimal resource allocation.
Contribution
It introduces a method to predict data and compute requirements for desired performance levels and optimize resource allocation in value-based RL.
Findings
Data and compute requirements follow a Pareto frontier controlled by UTD ratio.
Scaling relationships enable performance prediction across different data and compute levels.
Validated approach on multiple algorithms and environments for extrapolating to higher resource levels.
Abstract
Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Technology and Data Analysis · Engineering Applied Research
MethodsConvolution · 1x1 Convolution · Global Average Pooling · Average Pooling · Dilated Convolution · Switchable Atrous Convolution
