Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments
Maor Ivgi, Yair Carmon, Jonathan Berant

TL;DR
This paper empirically investigates neural scaling laws in language models, showing they can predict performance and aid debugging, but require careful tuning and multiple runs, affecting computational efficiency.
Contribution
It demonstrates the emergence of scaling laws at finetuning and their utility for model prediction and debugging across NLP tasks, highlighting practical considerations.
Findings
Scaling laws emerge at finetuning in some NLP tasks.
Scaling laws can predict larger model performance.
Careful hyperparameter tuning is necessary for revealing scaling laws.
Abstract
Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can be used to accelerate model development. In this work, we perform such an empirical investigation across a wide range of language understanding tasks, starting from models with as few as 10K parameters, and evaluate downstream performance across 9 language understanding tasks. We find that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models. Moreover, for tasks where scaling laws exist, they can be used to predict the performance of larger models, which enables effective model selection. However, revealing scaling laws requires careful hyperparameter tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science
