Demystifying ResNet
Sihan Li, Jiantao Jiao, Yanjun Han, Tsachy Weissman

TL;DR
This paper provides a theoretical explanation for why residual networks with shortcut of length two are easier to train and perform better, supported by extensive experiments comparing initialization methods and network depths.
Contribution
It offers a theoretical analysis of the role of shortcut depth in ResNet training difficulty and demonstrates the benefits of depth-two shortcuts through experiments.
Findings
Shortcut 2 leads to depth-invariant Hessian condition number.
Shortcut 1 behaves like no shortcut, with exploding condition number.
Small weight initialization with shortcut 2 improves training outcomes.
Abstract
The Residual Network (ResNet), proposed in He et al. (2015), utilized shortcut connections to significantly reduce the difficulty of training, which resulted in great performance boosts in terms of both training and generalization error. It was empirically observed in He et al. (2015) that stacking more layers of residual blocks with shortcut 2 results in smaller training error, while it is not true for shortcut of length 1 or 3. We provide a theoretical explanation for the uniqueness of shortcut 2. We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications
