An SDE for Modeling SAM: Theory and Insights
Enea Monzio Compagnoni, Luca Biggio, Antonio Orvieto, Frank Norbert, Proske, Hans Kersting, Aurelien Lucchi

TL;DR
This paper develops continuous-time stochastic differential equation models for the SAM optimizer, providing theoretical insights into its preference for flat minima and behavior near saddle points, supported by experiments.
Contribution
It introduces rigorous SDE models for SAM and its variants, explaining their optimization dynamics and regularization effects.
Findings
SAM favors flat minima due to Hessian-dependent noise.
SAM is attracted to saddle points under certain conditions.
The SDE models accurately approximate discrete algorithms.
Abstract
We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones~--~by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Advanced Bandit Algorithms Research
