Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun; Utku Evci; V. Ugur Guney; Yann Dauphin; Leon Bottou

arXiv:1706.04454·cs.LG·May 8, 2018·168 cites

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, Leon Bottou

PDF

Open Access

TL;DR

This paper empirically analyzes the Hessian spectrum of over-parameterized neural networks, revealing how data and model size influence the loss landscape's geometry and implications for optimization.

Contribution

It provides empirical and mathematical evidence on the Hessian spectrum structure and its relation to data, model size, and optimization landscapes in deep learning.

Findings

01

Hessian spectrum consists of a bulk near zero and outliers.

02

Increasing parameters scales the bulk without affecting outliers.

03

Different batch sizes converge to connected regions in the loss landscape.

Abstract

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Random Matrices and Applications