The effective noise of Stochastic Gradient Descent
Francesca Mignacco, Pierfrancesco Urbani

TL;DR
This paper characterizes the stochastic noise in SGD and persistent SGD within neural networks, quantifying how noise varies with parameters and influences decision boundary width, using dynamical mean-field theory and replica methods.
Contribution
It introduces a theoretical framework to quantify SGD noise via effective temperature and replica analysis, linking noise levels to decision boundary properties.
Findings
Effective temperature quantifies SGD noise in under-parametrized regime.
Noise measures from fluctuation-dissipation and replica methods are consistent.
Higher noise levels lead to wider decision boundaries.
Abstract
Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
