Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization

Maria Matveev; Vit Fojtik; Hung-Hsu Chou; Gitta Kutyniok; Johannes Maly

arXiv:2505.21423·cs.LG·December 19, 2025

Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization

Maria Matveev, Vit Fojtik, Hung-Hsu Chou, Gitta Kutyniok, Johannes Maly

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how the interplay between norm minimization and sharpness regularization influences the generalization of neural networks trained with large learning rates, revealing a dynamic trade-off beyond single implicit biases.

Contribution

It demonstrates that implicit regularization involves a balance between norm and sharpness, and that neither bias alone explains generalization, highlighting the need for a broader perspective.

Findings

01

Learning rate influences the trade-off between norm and sharpness.

02

Neither implicit bias alone suffices to explain generalization.

03

Empirical and theoretical analysis of diagonal linear networks supports the trade-off concept.

Abstract

A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this theoretically, recent works examine gradient descent and its variants in simplified training settings, often assuming vanishing learning rates. These studies reveal various forms of implicit regularization, such as $ℓ_{1}$ -norm minimizing parameters in regression and max-margin solutions in classification. Concurrently, empirical findings show that moderate to large learning rates exceeding standard stability thresholds lead to faster, albeit oscillatory, convergence in the so-called Edge-of-Stability regime, and induce an implicit bias towards minima of low sharpness (norm of training loss Hessian). In this work, we argue that a comprehensive understanding of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The empirical results are verified in a variety of settings with many additional plots in the Appendix. This includes multiple architectures, datasets, and loss functions. In addition, the empirical setup is extensively detailed in the appendix, which provides strong evidence for the claimed tradeoff between $\ell_1$ norm and sharpness. - The paper proposes multiple theoretical analyses for understanding the tradeoff between $\ell_1$ norm including an analysis of the diagonal linear net, and g

Weaknesses

- I believe that while the result is correctly stated in Appendix B, the paper mischaracterizes the results of Woodworth et al. throughout the rest of the paper. Their result is that under the model $f_w(x) = w^{\odot 2} \cdot x$, gradient flow with small initialization will converge to the parameter $w$ with minimal $\ell_2$ norm. If one defines the implicit classifier $\beta = w^{\odot 2}$ then this is equivalent to finding $\beta$ of minimal norm as $\|\beta\|_1 = \|w\|_2^2$. **Woodworth et

Reviewer 02Rating 0Confidence 5

Strengths

The idea of linking norm and sharpness regularization is cool but it is not established in the paper.

Weaknesses

The theoretical results are strictly weaker of an already published (more than 1 year ago) and **not cited** paper: https://arxiv.org/pdf/2502.20531. In diagonal linear networks there is no progressive sharpening, when you pick learning rate > 2/\lambda_{\max} of the hessian of the solution at initialization you converge to an oscillatory regime around the min-norm solution, this does not impact the generalization/error of the solution but only the norm of the way the solution is parameterized

Reviewer 03Rating 4Confidence 3

Strengths

The experiments are relatively comprehensive, covering a range of different settings. The phenomena can be observed consistently throughout (although more cleanly in some experiments than others). The paper is also well presented and easy to read.

Weaknesses

The paper mostly describes an observed phenomenon and provides some general intuition behind it. However, it was not entirely clear to me what I can do with the result. Is the main message that the learning rate is an important parameter to influence the type of bias? In itself, that does not seem particularly surprising. It has been observed before that GD can behave quite differently from GF. Or it could be the insight that, for generalization, it can be useful to avoid the extreme of either b

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic Policies and Impacts