Universal scaling laws in the gradient descent training of neural networks
Maksim Velikanov, Dmitry Yarotsky

TL;DR
This paper derives universal power-law scaling laws for the loss during neural network training with gradient descent, based on spectral analysis, applicable across various data distributions.
Contribution
It introduces a novel asymptotic characterization of training dynamics, linking loss decay to data dimension, activation smoothness, and function class, without distribution restrictions.
Findings
Loss behaves as a power law $L(t) \\sim t^{-\xi}$ at large times
Exponent \\xi is determined by data dimension and activation properties
Results are universal, not limited to specific data distributions
Abstract
Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law with exponent expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require specific form of a data distribution, for example Gaussian, thus making our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Face and Expression Recognition
