Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in   Training

Zhanpeng Zhou; Mingze Wang; Yuchen Mao; Bingrui Li; Junchi Yan

arXiv:2410.10373·cs.LG·February 21, 2025

Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training

Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, Junchi Yan

PDF

Open Access

TL;DR

This paper reveals that Sharpness-Aware Minimization (SAM) effectively finds flatter minima late in training, and applying SAM briefly at the end can match full training results, providing insights into its implicit bias and dynamics.

Contribution

It uncovers the late-stage effectiveness of SAM in selecting flatter minima and provides a theoretical understanding of its dynamics, extending insights to adversarial training.

Findings

01

SAM efficiently finds flatter minima late in training.

02

Brief late-stage SAM application matches full training results.

03

Two phases in SAM dynamics: escaping minima and converging to flatter minima.

Abstract

Sharpness-Aware Minimization (SAM) has substantially improved the generalization of neural networks under various settings. Despite the success, its effectiveness remains poorly understood. In this work, we discover an intriguing phenomenon in the training dynamics of SAM, shedding light on understanding its implicit bias towards flatter minima over Stochastic Gradient Descent (SGD). Specifically, we find that SAM efficiently selects flatter minima late in training. Remarkably, even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training. Subsequently, we delve deeper into the underlying mechanism behind this phenomenon. Theoretically, we identify two phases in the learning dynamics after applying SAM late in training: i) SAM first escapes the minimum found by SGD exponentially fast; and ii) then rapidly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques

MethodsStochastic Gradient Descent · Segment Anything Model