Non-Gaussianity of Stochastic Gradient Noise
Abhishek Panigrahi, Raghav Somani, Navin Goyal, Praneeth Netrapalli

TL;DR
This paper investigates the distribution of stochastic gradient noise in neural network training, revealing it is approximately Gaussian in early training phases for large batch sizes across various datasets and architectures.
Contribution
The study provides empirical evidence that SGN is Gaussian in early training stages for large batch sizes, enhancing understanding of SGD dynamics.
Findings
SGN is approximately Gaussian for batch sizes ≥256 in early training
This Gaussianity holds across different datasets and architectures
The result clarifies the nature of noise in SGD during initial training phases
Abstract
What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Adversarial Robustness in Machine Learning
