Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse,, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris, Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M., Ziegler, John Schulman, Dario Amodei, Sam McCandlish

TL;DR
This paper uncovers empirical scaling laws governing the performance of autoregressive models across various domains, revealing predictable improvements with increased model size and compute, and providing insights into model capabilities and limitations.
Contribution
It introduces universal scaling laws for autoregressive models across multiple data domains, linking model size, compute, and performance, and offers theoretical interpretations and practical forecasts.
Findings
Performance improves smoothly with model size and compute following power-law laws.
Nearly universal exponents describe optimal model size relative to compute budgets.
Transformers can nearly perfectly model certain image distributions at large scales.
Abstract
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal imagetext models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as TrueTrueModel, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
