The effective noise of Stochastic Gradient Descent

Francesca Mignacco; Pierfrancesco Urbani

arXiv:2112.10852·cond-mat.dis-nn·September 7, 2022

The effective noise of Stochastic Gradient Descent

Francesca Mignacco, Pierfrancesco Urbani

PDF

TL;DR

This paper characterizes the stochastic noise in SGD and persistent SGD within neural networks, quantifying how noise varies with parameters and influences decision boundary width, using dynamical mean-field theory and replica methods.

Contribution

It introduces a theoretical framework to quantify SGD noise via effective temperature and replica analysis, linking noise levels to decision boundary properties.

Findings

01

Effective temperature quantifies SGD noise in under-parametrized regime.

02

Noise measures from fluctuation-dissipation and replica methods are consistent.

03

Higher noise levels lead to wider decision boundaries.

Abstract

Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent