Stochastic gradient descent performs variational inference, converges to   limit cycles for deep networks

Pratik Chaudhari; Stefano Soatto

arXiv:1710.11029·cs.LG·January 17, 2018

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Pratik Chaudhari, Stefano Soatto

PDF

TL;DR

This paper reveals that stochastic gradient descent (SGD) performs variational inference with a different loss and exhibits limit cycle behavior in deep networks due to highly anisotropic gradient noise, challenging classical convergence assumptions.

Contribution

It proves that SGD minimizes a modified potential with an entropic term, performs variational inference for a different loss, and exhibits limit cycles instead of convergence in deep networks.

Findings

01

SGD minimizes an average potential plus entropic regularization.

02

SGD trajectories form limit cycles rather than converging to critical points.

03

Gradient noise in deep networks is highly anisotropic, with covariance rank as low as 1%.

Abstract

Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of highly non-isotropic gradient noise in SGD;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent