Unified Neural Network Scaling Laws and Scale-time Equivalence
Akhilan Boopathy, Ila Fiete

TL;DR
This paper introduces a unified theoretical framework for neural network scaling laws, revealing how model size, training time, and data volume interact, and demonstrating scale-time equivalence to optimize training strategies.
Contribution
It presents the first comprehensive theory linking model size, training duration, and data, including the novel concept of scale-time equivalence and a unified scaling law validated across benchmarks.
Findings
Scale-time equivalence allows performance prediction across different training regimes.
Larger models require less data for generalization, especially with longer training.
Increasing model size does not always lead to better performance, depending on training and data.
Abstract
As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The idea of functional equivalence between models at different scales is potentially very attractive for modern practices of neural network training, where the (training) compute-optimal trade-offs have to be balanced with the potential benefits of having a smaller model at inference time. 2. The idea of formalizing the relationship between training time and scale is also interesting from a theoretical perspective. In gradient descent, training time has been linked to various forms of expli
I think that despite the captivating and well-posed questions, the paper fails to deliver on several aspects. Namely: 1. The authors claim to theoretically demonstrate that scaling the size of a neural network is equivalent to increasing its training time. The connection between the model studied here and neural networks is unclear/misleading, and some design choices are not thoroughly explained. The theoretical model has the unusual structure of equation 2, which has no connection to theoretic
- An interesting theoretical take on scaling and double descent. - Mostly well written, clear figures.
- The claims made in the paper seem too strong. The abstract mentions theoretical characterization of deep neural networks multiple times but the paper only theoretically analyzes simple proxy models. It also mentions large scale neural networks but all experiments in the paper are performed on small MNIST / SVHN / CIFAR networks. The same goes for the claims in the contribution list in the introduction. - The paper does not sufficiently discuss its limitations (e.g. in terms of the weaknesses m
The paper studies an interesting phenomenon, relevant to practitioners and in line with the current trend in literature of scaling laws: can we train smaller models on more data without sacrificing performance? The authors also test their claims empirically on multiple datasets and architectures.
To begin with, in Section 3.1, the authors introduce a random subspace model as a toy setting to study their scaling argument. Could the authors explain why this model is a good approximation for what happens in the case of a neural network, as well as provide citations to prior works using a similar model (i.e. in line 127)? For example where does the input data get used in this model? What is the loss used to train model? It would help for the authors to expand with more details the setup they
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
