Observational Scaling Laws and the Predictability of Language Model   Performance

Yangjun Ruan; Chris J. Maddison; Tatsunori Hashimoto

arXiv:2405.10938·cs.LG·October 3, 2024·2 cites

Observational Scaling Laws and the Predictability of Language Model Performance

Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto

PDF

Open Access 1 Repo

TL;DR

This paper introduces an observational approach to understanding language model performance scaling laws by analyzing ~100 publicly available models, revealing predictable phenomena and enabling performance forecasting without extensive training.

Contribution

It proposes a generalized scaling law based on a low-dimensional capability space, allowing prediction of model performance and emergent phenomena from existing models.

Findings

01

Performance follows a sigmoidal, predictable pattern.

02

Model capabilities can be forecasted from small models.

03

Post-training interventions' impacts can be predicted.

Abstract

Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~100 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ryoungj/obsscaling
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout