Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks
Pierfrancesco Beneventano, Blake Woodworth

TL;DR
This paper analyzes how gradient descent in shallow linear networks converges rapidly to flatter minima with lower sharpness and norm compared to gradient flow, highlighting benefits of training at the Edge of Stability.
Contribution
It provides an explicit characterization of GD convergence rates and solutions, revealing a trade-off between convergence speed and implicit regularization in shallow linear networks.
Findings
GD converges linearly to global minima with large stepsizes.
GD solutions have lower norm and sharpness than gradient flow solutions.
Training at the Edge of Stability induces beneficial implicit regularization.
Abstract
We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about . It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Mathematical Modeling in Engineering · Mobile Ad Hoc Networks · Complexity and Algorithms in Graphs
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
