A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient   Descent Exponentially Favors Flat Minima

Zeke Xie; Issei Sato; and Masashi Sugiyama

arXiv:2002.03495·cs.LG·January 18, 2021·26 cites

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

Zeke Xie, Issei Sato, and Masashi Sugiyama

PDF

Open Access 1 Video

TL;DR

This paper introduces a diffusion theory explaining how stochastic gradient descent (SGD) preferentially finds flat minima in deep learning, with theoretical and empirical evidence showing exponential favoring over sharp minima based on noise covariance.

Contribution

It develops a density diffusion theory that quantitatively explains how SGD's noise structure biases it towards flat minima, a novel theoretical insight in deep learning optimization.

Findings

01

SGD exponentially favors flat minima over sharp minima due to Hessian-dependent noise.

02

Gradient Descent with white noise favors flat minima only polynomially.

03

Large-batch training requires exponentially more iterations to escape minima.

Abstract

Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Markov Chains and Monte Carlo Methods

MethodsStochastic Gradient Descent