Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon; Dongkuk Si; Chulhee Yun

arXiv:2603.08290·cs.LG·May 19, 2026

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon, Dongkuk Si, Chulhee Yun

PDF

1 Video 3 Reviews

TL;DR

This paper investigates the implicit bias of Sharpness-Aware Minimization (SAM) in training linear diagonal networks, revealing depth-dependent behaviors and a phenomenon called sequential feature amplification.

Contribution

It provides a theoretical analysis of SAM's implicit bias in linear models, highlighting differences from gradient descent and introducing the concept of sequential feature amplification.

Findings

01

For linear models, SAM recovers the max-margin classifier.

02

For depth 2, SAM's limit depends on initialization and can differ from GD.

03

In finite time, $ ext{l}_2$-SAM exhibits sequential feature amplification.

Abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$ -layer linear diagonal networks on linearly separable binary classification. For linear models ( $L = 1$ ), both $ℓ_{\infty}$ - and $ℓ_{2}$ -SAM recover the $ℓ_{2}$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$ , the behavior changes drastically -- even on a single-example dataset. For $ℓ_{\infty}$ -SAM, the limit direction depends critically on initialization and can converge to $0$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $ℓ_{2}$ -SAM, we show that although its limit direction matches the $ℓ_{1}$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

The paper is overall easy to read. In addition, the fact that the implicit bias of SAM changes with depth is interesting. A nice result is that $\ell_\infty$-SAM's limiting direction is dependent on initialization unlike GD / GF. (That said, for non linear architectures, GD / GF is also dependent on initialization).

Weaknesses

One weakness is the restrictive setting as the paper deals with L-layer diagonal linear networks and assumes the data is linear separable. Theorem 4.2 assumes directional convergence of the inner and outer layer as well as their flows which is an extremely strong assumption and something that is highly nontrivial to prove in general. In addition, the analysis is restricted to a dataset consisting of one point (that has a monotonic structure with respect to its entries). Finally, the analysis

Reviewer 02Rating 6Confidence 3

Strengths

The highlighted rich behaviors that SAM can exhibit is interesting and it is good to have more work studying logistic loss in this setting. The paper is well written and the experiments on MNIST to some extend backup the theoretical insights.

Weaknesses

Clearly, the model and data are very specialized. Maybe my main question would the impact of a non-linearity like ReLU would be. This is not covered by the experiments. No code was provided, making it more difficult to reproduce results or checking up implementational details of the experiments. For example, I do not think the paper details how exactly the data for Figure 7 is generated (although from the picture one might guess that it could be points drawn from a Gaussian distribution with me

Reviewer 03Rating 6Confidence 3

Strengths

1. Exceptionally clear and easy to follow. 2. The theoretical analysis is solid and convincingly demonstrates how network depth affects the implicit bias of SAM.

Weaknesses

It is unclear whether these theoretical findings can inform practical algorithmic improvements—for example, proposing a better SAM-style method.

Videos

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks