Observational Scaling Laws and the Predictability of Language Model Performance
Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto

TL;DR
This paper introduces an observational approach to understanding language model performance scaling laws by analyzing ~100 publicly available models, revealing predictable phenomena and enabling performance forecasting without extensive training.
Contribution
It proposes a generalized scaling law based on a low-dimensional capability space, allowing prediction of model performance and emergent phenomena from existing models.
Findings
Performance follows a sigmoidal, predictable pattern.
Model capabilities can be forecasted from small models.
Post-training interventions' impacts can be predicted.
Abstract
Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~100 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout
