Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation

Jianlong Chen; Zhiming Zhou

arXiv:2603.10048·cs.LG·March 12, 2026

Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation

Jianlong Chen, Zhiming Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper revisits Sharpness-Aware Minimization (SAM), providing a new interpretation of its effectiveness, identifying limitations, and proposing an improved method called XSAM that explicitly estimates the maximum sharpness, leading to better generalization.

Contribution

The paper introduces XSAM, a novel implementation of SAM that explicitly estimates sharpness, improves approximation accuracy, and enhances training effectiveness with minimal computational cost.

Findings

01

XSAM outperforms existing SAM variants in experiments.

02

Explicit sharpness estimation improves approximation accuracy.

03

XSAM maintains low computational overhead.

Abstract

Sharpness-Aware Minimization (SAM) enhances generalization by minimizing the maximum training loss within a predefined neighborhood around the parameters. However, its practical implementation approximates this as gradient ascent(s) followed by applying the gradient at the ascent point to update the current parameters. This practice can be justified as approximately optimizing the objective by neglecting the (full) derivative of the ascent point with respect to the current parameters. Nevertheless, a direct and intuitive understanding of why using the gradient at the ascent point to update the current parameters works superiorly is still lacking. Our work bridges this gap by proposing a novel and intuitive interpretation. We show that the gradient at the single-step ascent point, \uline{when applied to the current parameters}, provides a better approximation of the direction from the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. XSAM provides a more accurate and adaptive estimation of the direction toward the local maximum, leading to better generalization. 2. The method is unified across single-step and multi-step settings and shows consistent improvements over SAM with minimal computational overhead. 3. The paper offers a clear theoretical and intuitive explanation of SAM’s limitations, enhancing understanding of sharpness-aware optimization.

Weaknesses

1. XSAM introduces additional hyperparameters (e.g., search range for α and update frequency), which may complicate hyperparameter tuning. 2. Although the overhead is small, XSAM still requires extra forward passes for direction estimation, slightly increasing computational cost. 3. Although the paper critiques multi-step SAM’s degradation, the comparison with existing multi-step variants (like MSAM or LSAM) could be more extensive to fully establish XSAM’s superiority in that regime.

Reviewer 02Rating 4Confidence 4

Strengths

- The theoretical work on explaining the success of the SAM algorithm in minimizing the SAM objective, compared to the naive gradient, is of potential community interest. - The paper is probing between the SAM and naive gradient method, and it shows that it will strictly improve the performance, which is of potential interest to the community. The experiments also support this.

Weaknesses

- Long sentences, hard to follow. The paper would benefit from better writing, focusing on short and clean sentences.

Reviewer 03Rating 4Confidence 3

Strengths

- The paper is generally well-written and easy to follow. - The authors perform visualization studies to show how single-step SAM gradient directions better approximate ascent directions within the neighborhood, while multi-step SAM may degrade. These visualizations ground the theoretical intuition in empirical phenomena. - Despite the additional probing steps, the runtime overhead remains negligible. The method is compatible with SAM , making it practical and easy to integrate into real-world t

Weaknesses

- The underlying motivation for using $-v(\alpha^*)$ as the final gradient descent direction remains unclear. Following the direction of $-v(\alpha^*)$ appears to encourage moving away from a local neighborhood maximum. However, this does not necessarily guarantee convergence toward a flatter minimum. Additional clarification and theoretical justification would strengthen the argument. - In the experimental section, the paper primarily compares the proposed method against standard SAM and SGD. G

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning