Unified Neural Network Scaling Laws and Scale-time Equivalence

Akhilan Boopathy; Ila Fiete

arXiv:2409.05782·cs.LG·September 10, 2024

Unified Neural Network Scaling Laws and Scale-time Equivalence

Akhilan Boopathy, Ila Fiete

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a unified theoretical framework for neural network scaling laws, revealing how model size, training time, and data volume interact, and demonstrating scale-time equivalence to optimize training strategies.

Contribution

It presents the first comprehensive theory linking model size, training duration, and data, including the novel concept of scale-time equivalence and a unified scaling law validated across benchmarks.

Findings

01

Scale-time equivalence allows performance prediction across different training regimes.

02

Larger models require less data for generalization, especially with longer training.

03

Increasing model size does not always lead to better performance, depending on training and data.

Abstract

As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

1. The idea of functional equivalence between models at different scales is potentially very attractive for modern practices of neural network training, where the (training) compute-optimal trade-offs have to be balanced with the potential benefits of having a smaller model at inference time. 2. The idea of formalizing the relationship between training time and scale is also interesting from a theoretical perspective. In gradient descent, training time has been linked to various forms of expli

Weaknesses

I think that despite the captivating and well-posed questions, the paper fails to deliver on several aspects. Namely: 1. The authors claim to theoretically demonstrate that scaling the size of a neural network is equivalent to increasing its training time. The connection between the model studied here and neural networks is unclear/misleading, and some design choices are not thoroughly explained. The theoretical model has the unusual structure of equation 2, which has no connection to theoretic

Reviewer 02Rating 6Confidence 3

Strengths

- An interesting theoretical take on scaling and double descent. - Mostly well written, clear figures.

Weaknesses

- The claims made in the paper seem too strong. The abstract mentions theoretical characterization of deep neural networks multiple times but the paper only theoretically analyzes simple proxy models. It also mentions large scale neural networks but all experiments in the paper are performed on small MNIST / SVHN / CIFAR networks. The same goes for the claims in the contribution list in the introduction. - The paper does not sufficiently discuss its limitations (e.g. in terms of the weaknesses m

Reviewer 03Rating 3Confidence 3

Strengths

The paper studies an interesting phenomenon, relevant to practitioners and in line with the current trend in literature of scaling laws: can we train smaller models on more data without sacrificing performance? The authors also test their claims empirically on multiple datasets and architectures.

Weaknesses

To begin with, in Section 3.1, the authors introduce a random subspace model as a toy setting to study their scaling argument. Could the authors explain why this model is a good approximation for what happens in the case of a neural network, as well as provide citations to prior works using a similar model (i.e. in line 127)? For example where does the input data get used in this model? What is the loss used to train model? It would help for the authors to expand with more details the setup they

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications