On the Maximum Hessian Eigenvalue and Generalization
Simran Kaur, Jeremy Cohen, Zachary C. Lipton

TL;DR
This paper critically examines the relationship between the maximum Hessian eigenvalue ($_{max}$) and neural network generalization, revealing that smaller $_{max}$ does not always correlate with better generalization across various training interventions.
Contribution
The study provides empirical evidence challenging the assumption that $_{max}$ directly influences generalization, highlighting limitations of flatness-based metrics in explaining neural network performance.
Findings
Larger learning rates reduce $_{max}$ but do not always improve generalization at larger batch sizes.
Scaling batch size and learning rate can alter $_{max}$ without affecting generalization.
SAM reduces $_{max}$ but does not guarantee better generalization at larger batch sizes.
Abstract
The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly , the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between and generalization. In this paper, we present findings that call 's influence on generalization further into question. We show that: (1) while larger learning rates reduce for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Neural Networks and Applications
MethodsSharpness-Aware Minimization · Dropout · Stochastic Gradient Descent
