Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Kaiyue Wen, Zhiyuan Li, Tengyu Ma

TL;DR
This paper critically examines the assumption that sharpness minimization leads to better generalization in neural networks, revealing complex relationships influenced by data and architecture, and suggesting the need for alternative explanations.
Contribution
It provides a theoretical and empirical analysis showing that sharpness minimization algorithms do not solely explain generalization in neural networks.
Findings
Flatness can imply generalization in some cases.
Sharpness minimization algorithms can fail to generalize despite reducing sharpness.
Non-generalizing flat models can still be produced by sharpness minimization algorithms.
Abstract
Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize, and (3) perhaps most surprisingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
