TL;DR
This paper investigates the implicit bias of Sharpness-Aware Minimization (SAM) in training linear diagonal networks, revealing depth-dependent behaviors and a phenomenon called sequential feature amplification.
Contribution
It provides a theoretical analysis of SAM's implicit bias in linear models, highlighting differences from gradient descent and introducing the concept of sequential feature amplification.
Findings
For linear models, SAM recovers the max-margin classifier.
For depth 2, SAM's limit depends on initialization and can differ from GD.
In finite time, $ ext{l}_2$-SAM exhibits sequential feature amplification.
Abstract
We study the implicit bias of Sharpness-Aware Minimization (SAM) when training -layer linear diagonal networks on linearly separable binary classification. For linear models (), both - and -SAM recover the max-margin classifier, matching gradient descent (GD). However, for depth , the behavior changes drastically -- even on a single-example dataset. For -SAM, the limit direction depends critically on initialization and can converge to or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For -SAM, we show that although its limit direction matches the max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is overall easy to read. In addition, the fact that the implicit bias of SAM changes with depth is interesting. A nice result is that $\ell_\infty$-SAM's limiting direction is dependent on initialization unlike GD / GF. (That said, for non linear architectures, GD / GF is also dependent on initialization).
One weakness is the restrictive setting as the paper deals with L-layer diagonal linear networks and assumes the data is linear separable. Theorem 4.2 assumes directional convergence of the inner and outer layer as well as their flows which is an extremely strong assumption and something that is highly nontrivial to prove in general. In addition, the analysis is restricted to a dataset consisting of one point (that has a monotonic structure with respect to its entries). Finally, the analysis
The highlighted rich behaviors that SAM can exhibit is interesting and it is good to have more work studying logistic loss in this setting. The paper is well written and the experiments on MNIST to some extend backup the theoretical insights.
Clearly, the model and data are very specialized. Maybe my main question would the impact of a non-linearity like ReLU would be. This is not covered by the experiments. No code was provided, making it more difficult to reproduce results or checking up implementational details of the experiments. For example, I do not think the paper details how exactly the data for Figure 7 is generated (although from the picture one might guess that it could be points drawn from a Gaussian distribution with me
1. Exceptionally clear and easy to follow. 2. The theoretical analysis is solid and convincingly demonstrates how network depth affects the implicit bias of SAM.
It is unclear whether these theoretical findings can inform practical algorithmic improvements—for example, proposing a better SAM-style method.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks
