Sharp Minima Can Generalize For Deep Nets
Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio

TL;DR
This paper challenges the flatness hypothesis for deep nets' generalization, showing that sharp minima can also generalize well due to symmetries and reparameterizations in the model space.
Contribution
It demonstrates that flatness is not a necessary condition for generalization in deep networks, highlighting the role of symmetries and reparameterizations.
Findings
Flatness notions are problematic for deep models.
Equivalent models can correspond to arbitrarily sharp minima.
Reparameterizations can change geometry without affecting generalization.
Abstract
Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications
