LightSAM: Parameter-Agnostic Sharpness-Aware Minimization

Yifei Cheng; Li Shen; Hao Sun; Nan Yin; Xiaochun Cao; Enhong Chen

arXiv:2505.24399·cs.LG·June 2, 2025

LightSAM: Parameter-Agnostic Sharpness-Aware Minimization

Yifei Cheng, Li Shen, Hao Sun, Nan Yin, Xiaochun Cao, Enhong Chen

PDF

Open Access 4 Reviews

TL;DR

LightSAM is an adaptive variant of the SAM optimizer that automatically adjusts its hyperparameters, making it more robust and broadly applicable without the need for extensive tuning.

Contribution

We introduce LightSAM, which replaces fixed hyperparameters in SAM with adaptive strategies, enabling parameter-agnostic optimization with theoretical convergence guarantees.

Findings

01

LightSAM converges under weak assumptions regardless of hyperparameter choices.

02

Preliminary experiments show LightSAM's effectiveness across multiple deep learning tasks.

03

LightSAM reduces the need for hyperparameter tuning in sharpness-aware optimization.

Abstract

Sharpness-Aware Minimization (SAM) optimizer enhances the generalization ability of the machine learning model by exploring the flat minima landscape through weight perturbations. Despite its empirical success, SAM introduces an additional hyper-parameter, the perturbation radius, which causes the sensitivity of SAM to it. Moreover, it has been proved that the perturbation radius and learning rate of SAM are constrained by problem-dependent parameters to guarantee convergence. These limitations indicate the requirement of parameter-tuning in practical applications. In this paper, we propose the algorithm LightSAM which sets the perturbation radius and learning rate of SAM adaptively, thus extending the application scope of SAM. LightSAM employs three popular adaptive optimizers, including AdaGrad-Norm, AdaGrad and Adam, to replace the SGD optimizer for weight perturbation and model…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

- The paper develops a theoretical analysis of LightSAM’s convergence properties under adaptive learning rates and minimal assumptions, showing that it achieves a convergence rate of $O(\ln T/ \sqrt{T})$ - The experiments on diverse datasets, including MNIST and Imagenet, demonstrate LightSAM’s robustness and parameter-agnostic nature. - The paper includes comprehensive comparisons with SAM, AdaSAM, and other optimizers, showing that LightSAM achieves comparable accuracy while reducing tuning

Weaknesses

It is unclear how the hyper-parameter tuning complexity should be evaluated in the setting. While the paper has shown that under nine hyper-parameter settings, the LightSAM's standard deviation is small, it would be better to explain the protocol. Moreover, It would be better to conduct thorough studies on the hyper-parameter tuning (for example, under a larger range of hyper-parameters, the results are similar), and more experimental settings (for other types of tasks).

Reviewer 02Rating 6Confidence 4

Strengths

The convergence results being agnostic to parameter values, i.e., the perturbation radius, are interesting to the field.

Weaknesses

The benefits of the so-called Light SAM are not sufficiently presented in the experiments. The reasons of parameter-agnostic property in convergences are lacking, so are the extension to other optimizers aside of the presented 3.

Reviewer 03Rating 3Confidence 4

Strengths

- LightSAM effectively addresses a main limitation of SAM by making it parameter-agnostic, potentially simplifying hyperparameter tuning in practical applications. - The empirical results are promising. LightSAM’s performance on MNIST and ImageNet, particularly its insensitivity to hyperparameters, demonstrates its practical potential and competitive accuracy.

Weaknesses

I have a major concern regarding the soundness of the proof. I think that the authors ignore the correlation of $w_t$ and $\xi_t$. Then, two problems arise. First, due to this correlation, the equality such as Line 836-837 does not hold. Second, when applying $w_t$ on affine variance noise, the authors use a form like $\mathbb{E}\\|\nabla f(w\_t,\xi\_t)\\|^2 \le D_0+D_1 \mathbb{E}\\|\nabla f(w\_t)\\|^2$ which is not the form of Assumption 2 as there is no expectation on RHS of Assumption 2. I th

Reviewer 04Rating 6Confidence 4

Strengths

- Algorithm that makes SAM parameter-agnostic by removing hyperparameter restrictions on convergence - Provides theoretical analysis with $O(\log T/ \sqrt{T})$ convergence guarantees under weaker assumptions - Demonstrates consistent improvements over AdaSAM across different datasets

Weaknesses

- My main issue is the mismatch between theory and empirics: the theoretical results focus on optimization convergence for differentiable loss landscapes, while the empirical results demonstrate generalization performance. This disconnect makes it challenging to assess the practical relevance of the theoretical contribution. - Limited discussion of computational overhead, especially the extra overhead that comes from accumulating historical gradients, how long does it take in the experiments?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis · Industrial Vision Systems and Defect Detection · Photoacoustic and Ultrasonic Imaging

MethodsStochastic Gradient Descent · AdaGrad · Segment Anything Model · Adam