Language models scale reliably with over-training and on downstream   tasks

Samir Yitzhak Gadre; Georgios Smyrnis; Vaishaal Shankar and; Suchin Gururangan; Mitchell Wortsman; Rulin Shao; Jean Mercat and; Alex Fang; Jeffrey Li; Sedrick Keh; Rui Xin; Marianna Nezhurina; and Igor Vasiljevic; Jenia Jitsev; Luca Soldaini; Alexandros G.; Dimakis; Gabriel Ilharco; Pang Wei Koh; Shuran Song; Thomas; Kollar; Yair Carmon; Achal Dave; Reinhard Heckel; Niklas; Muennighoff; Ludwig Schmidt

arXiv:2403.08540·cs.CL·June 18, 2024·3 cites

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar and, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat and, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, and Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G., Dimakis

PDF

Open Access 1 Repo

TL;DR

This paper develops scaling laws for language models that account for over-training and downstream task performance, enabling accurate predictions of large model behavior from smaller experiments.

Contribution

It introduces a comprehensive testbed and scaling laws that predict large model performance and downstream task accuracy from small-scale experiments, addressing current gaps in scaling studies.

Findings

01

Scaling laws accurately predict loss for over-trained models.

02

Power law relates perplexity to downstream task error.

03

Predictions require significantly less compute than direct experiments.

Abstract

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlfoundations/scaling
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques