{\mu}P$^2$: Effective Sharpness Aware Minimization Requires Layerwise   Perturbation Scaling

Moritz Haas; Jin Xu; Volkan Cevher; Leena Chennuru Vankadara

arXiv:2411.00075·cs.LG·February 12, 2025

{\mu}P$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling

Moritz Haas, Jin Xu, Volkan Cevher, Leena Chennuru Vankadara

PDF

Open Access

TL;DR

This paper analyzes the scaling behavior of Sharpness Aware Minimization (SAM) in neural networks, revealing that standard SAM mainly affects the last layer in wide networks, and introduces a new layerwise perturbation scaling method, , for improved training.

Contribution

The paper introduces , a layerwise perturbation scaling parameterization, ensuring all layers learn features and are effectively perturbed, improving hyperparameter transferability across model scales.

Findings

01

Standard SAM mainly applies to the last layer in wide networks.

02

achieves better hyperparameter transfer across model scales.

03

The method extends to other perturbation rules like Adaptive SAM and SAM-ON.

Abstract

Sharpness Aware Minimization (SAM) enhances performance across various neural architectures and datasets. As models are continually scaled up to improve performance, a rigorous understanding of SAM's scaling behaviour is paramount. To this end, we study the infinite-width limit of neural networks trained with SAM, using the Tensor Programs framework. Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks, even with optimal hyperparameters. In contrast, we identify a stable parameterization with layerwise perturbation scaling, which we call $Maximal Update and Perturbation Parameterization$ ( $μ$ P $^{2}$ ), that ensures all layers are both feature learning and effectively perturbed in the limit. Through experiments with MLPs, ResNets and Vision Transformers, we empirically demonstrate that $μ$ P $^{2}$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Computer Graphics and Visualization Techniques

MethodsSegment Anything Model · Attentive Walk-Aggregating Graph Neural Network