Momentum-SAM: Sharpness Aware Minimization without Computational Overhead
Marlon Becker, Frederick Altrock, Benjamin Risse

TL;DR
Momentum-SAM introduces a new optimization method that achieves sharpness-aware minimization without additional computational costs, improving deep neural network training and generalization.
Contribution
The paper proposes Momentum-SAM, a novel variant of SAM that reduces computational overhead by using momentum-based parameter perturbation, enabling efficient sharpness-aware optimization.
Findings
MSAM matches SAM's generalization benefits
MSAM requires similar training time as SGD/Adam
MSAM effectively reduces overfitting
Abstract
The recently proposed optimization algorithm for deep neural networks Sharpness Aware Minimization (SAM) suggests perturbing parameters before gradient calculation by a gradient ascent step to guide the optimization into parameter space regions of flat loss. While significant generalization improvements and thus reduction of overfitting could be demonstrated, the computational costs are doubled due to the additionally needed gradient calculation, making SAM unfeasible in case of limited computationally capacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose Momentum-SAM (MSAM), which perturbs parameters in the direction of the accumulated momentum vector to achieve low sharpness without significant computational overhead or memory demands over SGD or Adam. We evaluate MSAM in detail and reveal insights on separable mechanisms of NAG, SAM and MSAM regarding training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
MethodsSegment Anything Model · Nesterov Accelerated Gradient · Attentive Walk-Aggregating Graph Neural Network · Adam · Stochastic Gradient Descent
