Scaling Laws for Autoregressive Generative Modeling

Tom Henighan; Jared Kaplan; Mor Katz; Mark Chen; Christopher Hesse,; Jacob Jackson; Heewoo Jun; Tom B. Brown; Prafulla Dhariwal; Scott Gray; Chris; Hallacy; Benjamin Mann; Alec Radford; Aditya Ramesh; Nick Ryder; Daniel M.; Ziegler; John Schulman; Dario Amodei; Sam McCandlish

arXiv:2010.14701·cs.LG·November 9, 2020·150 cites

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse,, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris, Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M., Ziegler, John Schulman, Dario Amodei, Sam McCandlish

PDF

Open Access

TL;DR

This paper uncovers empirical scaling laws governing the performance of autoregressive models across various domains, revealing predictable improvements with increased model size and compute, and providing insights into model capabilities and limitations.

Contribution

It introduces universal scaling laws for autoregressive models across multiple data domains, linking model size, compute, and performance, and offers theoretical interpretations and practical forecasts.

Findings

01

Performance improves smoothly with model size and compute following power-law laws.

02

Nearly universal exponents describe optimal model size relative to compute budgets.

03

Transformers can nearly perfectly model certain image distributions at large scales.

Abstract

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image $\leftrightarrow$ text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S ($ True $) + D_{KL} ($ True $∣∣$ Model $)$ , and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning