On the Maximum Hessian Eigenvalue and Generalization

Simran Kaur; Jeremy Cohen; Zachary C. Lipton

arXiv:2206.10654·cs.LG·May 25, 2023·6 cites

On the Maximum Hessian Eigenvalue and Generalization

Simran Kaur, Jeremy Cohen, Zachary C. Lipton

PDF

Open Access

TL;DR

This paper critically examines the relationship between the maximum Hessian eigenvalue ($_{max}$) and neural network generalization, revealing that smaller $_{max}$ does not always correlate with better generalization across various training interventions.

Contribution

The study provides empirical evidence challenging the assumption that $_{max}$ directly influences generalization, highlighting limitations of flatness-based metrics in explaining neural network performance.

Findings

01

Larger learning rates reduce $_{max}$ but do not always improve generalization at larger batch sizes.

02

Scaling batch size and learning rate can alter $_{max}$ without affecting generalization.

03

SAM reduces $_{max}$ but does not guarantee better generalization at larger batch sizes.

Abstract

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{ma x}$ , the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $λ_{ma x}$ and generalization. In this paper, we present findings that call $λ_{ma x}$ 's influence on generalization further into question. We show that: (1) while larger learning rates reduce $λ_{ma x}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Neural Networks and Applications

MethodsSharpness-Aware Minimization · Dropout · Stochastic Gradient Descent