Improving SAM Requires Rethinking its Optimization Formulation
Wanyun Xie, Fabian Latorre, Kimon Antonakopoulos, Thomas Pethick,, Volkan Cevher

TL;DR
This paper proposes a new formulation of Sharpness-Aware Minimization (SAM) as a bilevel optimization problem called BiSAM, using a 0-1 loss relaxation to improve perturbation strength and enhance model performance.
Contribution
It introduces BiSAM, a novel SAM variant reformulated as a bilevel optimization problem with a 0-1 loss surrogate, leading to stronger perturbations and better results.
Findings
BiSAM outperforms original SAM and variants in experiments.
BiSAM maintains similar computational complexity to SAM.
The code for BiSAM is publicly available.
Abstract
This paper rethinks Sharpness-Aware Minimization (SAM), which is originally formulated as a zero-sum game where the weights of a network and a bounded perturbation try to minimize/maximize, respectively, the same differentiable loss. To fundamentally improve this design, we argue that SAM should instead be reformulated using the 0-1 loss. As a continuous relaxation, we follow the simple conventional approach where the minimizing (maximizing) player uses an upper bound (lower bound) surrogate to the 0-1 loss. This leads to a novel formulation of SAM as a bilevel optimization problem, dubbed as BiSAM. BiSAM with newly designed lower-bound surrogate loss indeed constructs stronger perturbation. Through numerical evidence, we show that BiSAM consistently results in improved performance when compared to the original SAM and variants, while enjoying similar computational complexity. Our code…
Peer Reviews
Decision·ICML 2024 Poster
- The approach is simple, scalable, and theoretical-sound - The flow is easy to follow - The improvements are convincing and validated in many learning scenarios, including standard learning, fine-tuning and noisy-data learning
- As mentioned in the conclusion, it will be great to see if BiSAM benefits other domains, e.g. NLP
- The idea of directly aiming to solve min-max of 0-1 loss and accordingly minimizing/maximizing different surrogates brings novelty. - The authors provide theoretically justified lower bound for practical implementation. They also provide a clear discussion on two different choices of surrogates. - The numerical results demonstrate that BiSAM improves accuracy.
- The numerical results show limited improvements. Also, in some other works (Foret et al., 2021; Liu et al., 2022), SAM achieves accuracy higher than the accuracy of BiSAM in this paper (with the same model and number of epochs). Liu, Y., Mai, S., Cheng, M., Chen, X., Hsieh, C. J., & You, Y. (2022). Random sharpness-aware minimization. Advances in Neural Information Processing Systems, 35, 24543-24556.
The BiSAM method proposed in the paper somewhat resolves the issue of optimizing the 0-1 loss using gradients. This method has been validated across multiple datasets, demonstrating its advantages over SAM through extensive experiments.
1. "The idea of BiSAM is very good, but its performance in experiments is only marginally better than SAM. The improvement over SAM is often within the range of error, making it hard to believe that it is an enhancement of SAM. 2. Can you explain why BiSAM using tanh as the lower bound has higher test accuracy on CIFAR-10 compared to using -log as the lower bound, but the results are the opposite on CIFAR-100? 3. Could you combine the characteristics of tanh and -log to create a new lower boun
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Software Reliability and Analysis Research
MethodsSharpness-Aware Minimization · Segment Anything Model
