Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, Atish Agarwala

TL;DR
This paper uncovers a universal behavior in neural network training where loss curves from models of different sizes align when properly normalized, revealing fundamental scaling laws and a new indicator for optimal training.
Contribution
It demonstrates the universality of loss curve collapse in compute-optimized neural networks and introduces the concept of supercollapse as a practical scaling indicator.
Findings
Loss curves from various models collapse onto a universal curve.
Supercollapse occurs with proper learning rate decay, surpassing noise levels.
Collapse breaks down with suboptimal hyperparameter scaling, indicating poor scaling.
Abstract
What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, and data, compute-optimally trained models exhibit a remarkably precise universality. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor of individual loss curves across random seeds, a phenomenon we term supercollapse. We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction, and find it breaks down when hyperparameters are scaled suboptimally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks
