Language models scale reliably with over-training and on downstream tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar and, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat and, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, and Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G., Dimakis

TL;DR
This paper develops scaling laws for language models that account for over-training and downstream task performance, enabling accurate predictions of large model behavior from smaller experiments.
Contribution
It introduces a comprehensive testbed and scaling laws that predict large model performance and downstream task accuracy from small-scale experiments, addressing current gaps in scaling studies.
Findings
Scaling laws accurately predict loss for over-trained models.
Power law relates perplexity to downstream task error.
Predictions require significantly less compute than direct experiments.
Abstract
Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
