On the Noisy Gradient Descent that Generalizes as SGD
Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman,, Zhanxing Zhu

TL;DR
This paper investigates the role of gradient noise in SGD's generalization, showing that various noise distributions can effectively regularize training, and proposes a flexible noisy gradient descent algorithm that improves generalization, even for large batch sizes.
Contribution
It reveals that noise class is not crucial for regularization, introduces a novel understanding of SGD noise structure, and proposes a flexible algorithm that enhances generalization in deep learning.
Findings
Different noise classes can regularize gradient descent effectively.
SGD noise is a product of gradient matrix and sampling noise.
Proposed algorithm improves generalization, including in large batch training.
Abstract
The gradient noise of SGD is considered to play a central role in the observed strong generalization abilities of deep learning. While past studies confirm that the magnitude and the covariance structure of gradient noise are critical for regularization, it remains unclear whether or not the class of noise distributions is important. In this work we provide negative results by showing that noises in classes different from the SGD noise can also effectively regularize gradient descent. Our finding is based on a novel observation on the structure of the SGD noise: it is the multiplication of the gradient matrix and a sampling noise that arises from the mini-batch sampling procedure. Moreover, the sampling noises unify two kinds of gradient regularizing noises that belong to the Gaussian class: the one using (scaled) Fisher as covariance and the one using the gradient covariance of SGD as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
MethodsStochastic Gradient Descent
