{\mu}P$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
Moritz Haas, Jin Xu, Volkan Cevher, Leena Chennuru Vankadara

TL;DR
This paper analyzes the scaling behavior of Sharpness Aware Minimization (SAM) in neural networks, revealing that standard SAM mainly affects the last layer in wide networks, and introduces a new layerwise perturbation scaling method, , for improved training.
Contribution
The paper introduces , a layerwise perturbation scaling parameterization, ensuring all layers learn features and are effectively perturbed, improving hyperparameter transferability across model scales.
Findings
Standard SAM mainly applies to the last layer in wide networks.
achieves better hyperparameter transfer across model scales.
The method extends to other perturbation rules like Adaptive SAM and SAM-ON.
Abstract
Sharpness Aware Minimization (SAM) enhances performance across various neural architectures and datasets. As models are continually scaled up to improve performance, a rigorous understanding of SAM's scaling behaviour is paramount. To this end, we study the infinite-width limit of neural networks trained with SAM, using the Tensor Programs framework. Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks, even with optimal hyperparameters. In contrast, we identify a stable parameterization with layerwise perturbation scaling, which we call (P), that ensures all layers are both feature learning and effectively perturbed in the limit. Through experiments with MLPs, ResNets and Vision Transformers, we empirically demonstrate that P…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Computer Graphics and Visualization Techniques
MethodsSegment Anything Model · Attentive Walk-Aggregating Graph Neural Network
