Inherent Noise in Gradient Based Methods
Arushi Gupta

TL;DR
This paper investigates how the inherent noise in gradient-based methods like GD and SGD influences model robustness and generalization, especially in larger neural networks, by analyzing the effects of stale parameter updates.
Contribution
It reveals that the update mechanism in GD and SGD introduces noise that penalizes sensitive models, with effects more pronounced in larger models and during batch updates.
Findings
Noise from stale updates penalizes sensitive models
Larger models experience higher penalties
Noise effects are most pronounced during batch updates
Abstract
Previous work has examined the ability of larger capacity neural networks to generalize better than smaller ones, even without explicit regularizers, by analyzing gradient based algorithms such as GD and SGD. The presence of noise and its effect on robustness to parameter perturbations has been linked to generalization. We examine a property of GD and SGD, namely that instead of iterating through all scalar weights in the network and updating them one by one, GD (and SGD) updates all the parameters at the same time. As a result, each parameter calculates its partial derivative at the stale parameter , but then suffers loss . We show that this causes noise to be introduced into the optimization. We find that this noise penalizes models that are sensitive to perturbations in the weights. We find that penalties are most pronounced for batches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Model Reduction and Neural Networks
MethodsStochastic Gradient Descent
