The loss surface of deep and wide neural networks
Quynh Nguyen, Matthias Hein

TL;DR
This paper demonstrates that for certain wide, pyramidal neural networks with analytic activation functions, almost all local minima are globally optimal, explaining why training often succeeds despite non-convexity.
Contribution
It proves that in wide, pyramidal neural networks with squared loss, nearly all local minima are globally optimal, under specific structural conditions.
Findings
Almost all local minima are globally optimal in the specified network class.
The result applies to networks with more hidden units than training points.
Training success is explained by the landscape's benign geometry.
Abstract
While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
