A Hitchhiker's Guide to Scaling Law Estimation

Leshem Choshen; Yang Zhang; Jacob Andreas

arXiv:2410.11840·cs.LG·June 4, 2025

A Hitchhiker's Guide to Scaling Law Estimation

Leshem Choshen, Yang Zhang, Jacob Andreas

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how to accurately estimate scaling laws for machine learning models by analyzing a large dataset of pretrained models, providing best practices and insights into variability and transferability of scaling behaviors.

Contribution

It introduces a large-scale dataset of pretrained models, evaluates over 1000 scaling laws, and offers practical guidelines for estimating scaling laws in new model families.

Findings

01

Fitting scaling laws to intermediate checkpoints improves accuracy.

02

Estimations are more accurate when models are of similar sizes.

03

Training multiple small models can be more effective than one large model.

Abstract

Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IBM/ColPret
noneOfficial

Videos

A Hitchhiker's Guide to Scaling Law Estimation· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Topic Modeling

MethodsSparse Evolutionary Training