A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima
Zeke Xie, Issei Sato, and Masashi Sugiyama

TL;DR
This paper introduces a diffusion theory explaining how stochastic gradient descent (SGD) preferentially finds flat minima in deep learning, with theoretical and empirical evidence showing exponential favoring over sharp minima based on noise covariance.
Contribution
It develops a density diffusion theory that quantitatively explains how SGD's noise structure biases it towards flat minima, a novel theoretical insight in deep learning optimization.
Findings
SGD exponentially favors flat minima over sharp minima due to Hessian-dependent noise.
Gradient Descent with white noise favors flat minima only polynomially.
Large-batch training requires exponentially more iterations to escape minima.
Abstract
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Markov Chains and Monte Carlo Methods
MethodsStochastic Gradient Descent
