Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long; Peter L. Bartlett

arXiv:2309.12488·cs.LG·June 7, 2024

Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long, Peter L. Bartlett

PDF

Open Access 1 Repo

TL;DR

This paper analyzes the 'edge of stability' phenomenon in neural network training, extending it to Sharpness-Aware Minimization (SAM) and showing that SAM operates at this stability boundary influenced by the gradient norm.

Contribution

The paper derives a new 'edge of stability' condition for SAM, revealing its dependence on the gradient norm, and empirically confirms SAM's operation at this stability boundary.

Findings

01

SAM's edge of stability depends on the gradient norm.

02

Empirical evidence shows SAM operates at the derived stability boundary.

03

SAM improves generalization by operating at this stability edge.

Abstract

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $η$ , the operator norm of the Hessian of the loss grows until it approximately reaches $2/ η$ , after which it fluctuates around this value. The quantity $2/ η$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/sam_edge
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications

MethodsSegment Anything Model · Sharpness-Aware Minimization