Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin, Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

TL;DR
This paper investigates empirical scaling laws in neural language models, revealing power-law relationships between performance, model size, dataset size, and compute, guiding optimal training strategies for large models.
Contribution
It establishes simple, quantitative scaling laws for language model performance and compute efficiency, providing practical guidelines for training large models effectively.
Findings
Performance scales as a power-law with model size, dataset size, and compute.
Architectural details like width and depth have minimal impact within studied ranges.
Optimal training involves large models with modest data, stopping before convergence.
Abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗StentorLabs/Stentor2-12Mmodel· 124 dl· ♡ 2124 dl♡ 2
- 🤗lightonai/pagnol-smallmodel· 175 dl· ♡ 1175 dl♡ 1
- 🤗lightonai/pagnol-mediummodel· 10 dl· ♡ 110 dl♡ 1
- 🤗lightonai/pagnol-largemodel· 10 dl· ♡ 110 dl♡ 1
- 🤗lightonai/pagnol-xlmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗wordgrammer/Plato_v1model· ♡ 1♡ 1
- 🤗jd0g/chess-gptmodel
- 🤗RichardErkhov/lightonai_-_pagnol-medium-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/lightonai_-_pagnol-medium-8bitsmodel· 4 dl4 dl
- 🤗HarleyCooper/nanochat561model· 24 dl· ♡ 624 dl♡ 6
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Neural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
