Towards Efficient and Scalable Sharpness-Aware Minimization
Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, Yang You

TL;DR
This paper introduces LookSAM, a more efficient variant of Sharpness-Aware Minimization that reduces computational costs while maintaining accuracy, enabling large-batch training of vision transformers from scratch in minutes.
Contribution
We propose LookSAM, a novel algorithm that periodically computes inner gradients in SAM, significantly reducing training overhead and enabling scalable large-batch training of vision transformers.
Findings
LookSAM achieves similar accuracy gains to SAM with much lower computational cost.
We successfully scale up batch size to 64k for training ViTs from scratch.
Training ViTs with LookSAM in minutes maintains competitive performance.
Abstract
Recently, Sharpness-Aware Minimization (SAM), which connects the geometry of the loss landscape and generalization, has demonstrated significant performance boosts on training large-scale models such as vision transformers. However, the update rule of SAM requires two sequential (non-parallelizable) gradient computations at each step, which can double the computational overhead. In this paper, we propose a novel algorithm LookSAM - that only periodically calculates the inner gradient ascent, to significantly reduce the additional training cost of SAM. The empirical results illustrate that LookSAM achieves similar accuracy gains to SAM while being tremendously faster - it enjoys comparable computational complexity with first-order optimizers such as SGD or Adam. To further evaluate the performance and scalability of LookSAM, we incorporate a layer-wise modification and perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing
MethodsStochastic Gradient Descent · Adam · Sharpness-Aware Minimization
