Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, Junchi Yan

TL;DR
This paper reveals that Sharpness-Aware Minimization (SAM) effectively finds flatter minima late in training, and applying SAM briefly at the end can match full training results, providing insights into its implicit bias and dynamics.
Contribution
It uncovers the late-stage effectiveness of SAM in selecting flatter minima and provides a theoretical understanding of its dynamics, extending insights to adversarial training.
Findings
SAM efficiently finds flatter minima late in training.
Brief late-stage SAM application matches full training results.
Two phases in SAM dynamics: escaping minima and converging to flatter minima.
Abstract
Sharpness-Aware Minimization (SAM) has substantially improved the generalization of neural networks under various settings. Despite the success, its effectiveness remains poorly understood. In this work, we discover an intriguing phenomenon in the training dynamics of SAM, shedding light on understanding its implicit bias towards flatter minima over Stochastic Gradient Descent (SGD). Specifically, we find that SAM efficiently selects flatter minima late in training. Remarkably, even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training. Subsequently, we delve deeper into the underlying mechanism behind this phenomenon. Theoretically, we identify two phases in the learning dynamics after applying SAM late in training: i) SAM first escapes the minimum found by SGD exponentially fast; and ii) then rapidly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsStochastic Gradient Descent · Segment Anything Model
