Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari; Anna Choromanska; Stefano Soatto; Yann LeCun; Carlo; Baldassi; Christian Borgs; Jennifer Chayes; Levent Sagun; Riccardo Zecchina

arXiv:1611.01838·cs.LG·April 24, 2017·114 cites

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo, Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina

PDF

Open Access 2 Repos

TL;DR

Entropy-SGD is a novel optimization algorithm that biases gradient descent towards wide, flat minima in the energy landscape, leading to better generalization in deep neural networks.

Contribution

The paper introduces Entropy-SGD, a new method that incorporates local entropy into the optimization process to find flatter minima, improving generalization.

Findings

01

Entropy-SGD produces flatter minima with better generalization.

02

It outperforms standard SGD in test error and training efficiency.

03

The method is effective on convolutional and recurrent neural networks.

Abstract

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications

MethodsStochastic Gradient Descent