Are ResNets Provably Better than Linear Predictors?
Ohad Shamir

TL;DR
This paper proves that deep nonlinear ResNets do not get stuck at poor local minima and can outperform linear predictors, under minimal assumptions, by analyzing their optimization landscape.
Contribution
It provides a rigorous theoretical analysis showing ResNets' optimization landscape lacks poor local minima and can surpass linear predictors, with minimal assumptions.
Findings
ResNets' landscape has no local minima worse than linear predictors.
Deep nonlinear ResNets can achieve better performance than linear models.
Standard SGD can train ResNets to near-optimal solutions.
Abstract
A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remove the residual layers and train a shallower network instead. However, due to the non-convexity of the optimization problem, it is not at all clear that ResNets indeed achieve this behavior, rather than getting stuck at some arbitrarily poor local minimum. In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Generative Adversarial Networks and Image Synthesis
