(Mis)Fitting: A Survey of Scaling Laws
Margaret Li, Sneha Kudugunta, Luke Zettlemoyer

TL;DR
This survey critically examines how various factors influence the derivation of scaling laws in foundation models, highlighting discrepancies in prior research and proposing guidelines for reproducibility.
Contribution
The paper provides a comprehensive review of over 50 studies on scaling laws, analyzes the impact of methodological differences, and introduces a checklist to improve reproducibility in scaling law research.
Findings
Most studies use power laws to describe scaling trends.
Methodological differences significantly affect scaling law conclusions.
Many papers lack crucial details for reproducibility.
Abstract
Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
