Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow   in Shallow Linear Networks

Pierfrancesco Beneventano; Blake Woodworth

arXiv:2501.09137·cs.LG·January 22, 2025

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

Pierfrancesco Beneventano, Blake Woodworth

PDF

Open Access

TL;DR

This paper analyzes how gradient descent in shallow linear networks converges rapidly to flatter minima with lower sharpness and norm compared to gradient flow, highlighting benefits of training at the Edge of Stability.

Contribution

It provides an explicit characterization of GD convergence rates and solutions, revealing a trade-off between convergence speed and implicit regularization in shallow linear networks.

Findings

01

GD converges linearly to global minima with large stepsizes.

02

GD solutions have lower norm and sharpness than gradient flow solutions.

03

Training at the Edge of Stability induces beneficial implicit regularization.

Abstract

We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/ sharpness$ . It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Mathematical Modeling in Engineering · Mobile Ad Hoc Networks · Complexity and Algorithms in Graphs

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings