Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo, Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina

TL;DR
Entropy-SGD is a novel optimization algorithm that biases gradient descent towards wide, flat minima in the energy landscape, leading to better generalization in deep neural networks.
Contribution
The paper introduces Entropy-SGD, a new method that incorporates local entropy into the optimization process to find flatter minima, improving generalization.
Findings
Entropy-SGD produces flatter minima with better generalization.
It outperforms standard SGD in test error and training efficiency.
The method is effective on convolutional and recurrent neural networks.
Abstract
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
MethodsStochastic Gradient Descent
