Universal scaling laws in the gradient descent training of neural   networks

Maksim Velikanov; Dmitry Yarotsky

arXiv:2105.00507·cs.LG·May 4, 2021·5 cites

Universal scaling laws in the gradient descent training of neural networks

Maksim Velikanov, Dmitry Yarotsky

PDF

Open Access

TL;DR

This paper derives universal power-law scaling laws for the loss during neural network training with gradient descent, based on spectral analysis, applicable across various data distributions.

Contribution

It introduces a novel asymptotic characterization of training dynamics, linking loss decay to data dimension, activation smoothness, and function class, without distribution restrictions.

Findings

01

Loss behaves as a power law $L(t) \\sim t^{-\xi}$ at large times

02

Exponent \\xi is determined by data dimension and activation properties

03

Results are universal, not limited to specific data distributions

Abstract

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law $L (t) \sim t^{- ξ}$ with exponent $ξ$ expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require specific form of a data distribution, for example Gaussian, thus making our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Face and Expression Recognition