Towards Understanding the Importance of Shortcut Connections in Residual   Networks

Tianyi Liu; Minshuo Chen; Mo Zhou; Simon S. Du; Enlu Zhou; Tuo Zhao

arXiv:1909.04653·cs.LG·November 5, 2019·22 cites

Towards Understanding the Importance of Shortcut Connections in Residual Networks

Tianyi Liu, Minshuo Chen, Mo Zhou, Simon S. Du, Enlu Zhou, Tuo Zhao

PDF

Open Access

TL;DR

This paper investigates why residual networks train efficiently, showing that gradient descent with proper normalization can find global optima despite non-convexity, especially with specific initializations.

Contribution

It provides a theoretical analysis demonstrating convergence guarantees for training a two-layer residual network under certain conditions.

Findings

01

Gradient descent with normalization avoids spurious local optima.

02

Proper initialization ensures polynomial-time convergence.

03

Numerical experiments support the theoretical results.

Abstract

Residual Network (ResNet) is undoubtedly a milestone in deep learning. ResNet is equipped with shortcut connections between layers, and exhibits efficient training using simple first order algorithms. Despite of the great empirical success, the reason behind is far from being well understood. In this paper, we study a two-layer non-overlapping convolutional ResNet. Training such a network requires solving a non-convex optimization problem with a spurious local optimum. We show, however, that gradient descent combined with proper normalization, avoids being trapped by the spurious local optimum, and converges to a global optimum in polynomial time, when the weight of the first layer is initialized at 0, and that of the second layer is initialized arbitrarily in a ball. Numerical experiments are provided to support our theory.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection