Scaling Laws for Neural Language Models

Jared Kaplan; Sam McCandlish; Tom Henighan; Tom B. Brown; Benjamin; Chess; Rewon Child; Scott Gray; Alec Radford; Jeffrey Wu; Dario Amodei

arXiv:2001.08361·cs.LG·January 24, 2020·1.5k cites

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin, Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

PDF

Open Access 2 Repos 10 Models 1 Datasets

TL;DR

This paper investigates empirical scaling laws in neural language models, revealing power-law relationships between performance, model size, dataset size, and compute, guiding optimal training strategies for large models.

Contribution

It establishes simple, quantitative scaling laws for language model performance and compute efficiency, providing practical guidelines for training large models effectively.

Findings

01

Performance scales as a power-law with model size, dataset size, and compute.

02

Architectural details like width and depth have minimal impact within studied ranges.

03

Optimal training involves large models with modest data, stopping before convergence.

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

TheBlueScrubs/TheBlueScrubs-v2
dataset· 153 dl
153 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Neural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings