Non-Gaussianity of Stochastic Gradient Noise

Abhishek Panigrahi; Raghav Somani; Navin Goyal; Praneeth Netrapalli

arXiv:1910.09626·cs.LG·October 28, 2019·23 cites

Non-Gaussianity of Stochastic Gradient Noise

Abhishek Panigrahi, Raghav Somani, Navin Goyal, Praneeth Netrapalli

PDF

Open Access

TL;DR

This paper investigates the distribution of stochastic gradient noise in neural network training, revealing it is approximately Gaussian in early training phases for large batch sizes across various datasets and architectures.

Contribution

The study provides empirical evidence that SGN is Gaussian in early training stages for large batch sizes, enhancing understanding of SGD dynamics.

Findings

01

SGN is approximately Gaussian for batch sizes ≥256 in early training

02

This Gaussianity holds across different datasets and architectures

03

The result clarifies the nature of noise in SGD during initial training phases

Abstract

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Adversarial Robustness in Machine Learning