Inherent Noise in Gradient Based Methods

Arushi Gupta

arXiv:2005.12743·cs.LG·May 27, 2020

Inherent Noise in Gradient Based Methods

Arushi Gupta

PDF

Open Access

TL;DR

This paper investigates how the inherent noise in gradient-based methods like GD and SGD influences model robustness and generalization, especially in larger neural networks, by analyzing the effects of stale parameter updates.

Contribution

It reveals that the update mechanism in GD and SGD introduces noise that penalizes sensitive models, with effects more pronounced in larger models and during batch updates.

Findings

01

Noise from stale updates penalizes sensitive models

02

Larger models experience higher penalties

03

Noise effects are most pronounced during batch updates

Abstract

Previous work has examined the ability of larger capacity neural networks to generalize better than smaller ones, even without explicit regularizers, by analyzing gradient based algorithms such as GD and SGD. The presence of noise and its effect on robustness to parameter perturbations has been linked to generalization. We examine a property of GD and SGD, namely that instead of iterating through all scalar weights in the network and updating them one by one, GD (and SGD) updates all the parameters at the same time. As a result, each parameter $w^{i}$ calculates its partial derivative at the stale parameter $w_{t}$ , but then suffers loss $\hat{L} (w_{t + 1})$ . We show that this causes noise to be introduced into the optimization. We find that this noise penalizes models that are sensitive to perturbations in the weights. We find that penalties are most pronounced for batches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Model Reduction and Neural Networks

MethodsStochastic Gradient Descent