The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent
Lei Wu, Weijie J. Su

TL;DR
This paper investigates how stochastic gradient descent (SGD) implicitly regularizes model complexity through dynamical stability, leading to better generalization compared to gradient descent (GD), especially influenced by the learning rate.
Contribution
It establishes a theoretical link between stability metrics and generalization in SGD, contrasting it with GD, and highlights the role of learning rate in regularization strength.
Findings
Stable minima of SGD generalize well due to sharpness and norm equivalence.
GD's stability is too weak for effective regularization.
Larger learning rates enhance SGD's regularization effect.
Abstract
In this paper, we study the implicit regularization of stochastic gradient descent (SGD) through the lens of {\em dynamical stability} (Wu et al., 2018). We start by revising existing stability analyses of SGD, showing how the Frobenius norm and trace of Hessian relate to different notions of stability. Notably, if a global minimum is linearly stable for SGD, then the trace of Hessian must be less than or equal to , where denotes the learning rate. By contrast, for gradient descent (GD), the stability imposes a similar constraint but only on the largest eigenvalue of Hessian. We then turn to analyze the generalization properties of these stable minima, focusing specifically on two-layer ReLU networks and diagonal linear networks. Notably, we establish the {\em equivalence} between these metrics of sharpness and certain parameter norms for the two models, which allows us…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies · Advanced Fluorescence Microscopy Techniques · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent
