Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)
Yuqing Li, Tao Luo, Nung Kwan Yip

TL;DR
This paper analyzes the training dynamics of finite-width deep residual networks using the neural tangent hierarchy, revealing how skip connections contribute to their success over fully-connected networks.
Contribution
It extends neural tangent kernel analysis to ResNets with finite width, showing reduced width requirements and highlighting the importance of skip connections.
Findings
ResNets with smooth activation functions require cubic width in samples for analysis.
Skip connections are key to ResNet's superior performance.
Analysis suggests ResNet's structure is crucial for its training efficiency.
Abstract
Gradient descent yields zero training loss in polynomial time for deep neural networks despite non-convex nature of the objective function. The behavior of network in the infinite width limit trained by gradient descent can be described by the Neural Tangent Kernel (NTK) introduced in \cite{Jacot2018Neural}. In this paper, we study dynamics of the NTK for finite width Deep Residual Network (ResNet) using the neural tangent hierarchy (NTH) proposed in \cite{Huang2019Dynamics}. For a ResNet with smooth and Lipschitz activation function, we reduce the requirement on the layer width with respect to the number of training samples from quartic to cubic. Our analysis suggests strongly that the particular skip-connection structure of ResNet is the main reason for its triumph over fully-connected network.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning
MethodsNeural Tangent Kernel · Kaiming Initialization · 1x1 Convolution · Average Pooling · Convolution · Global Average Pooling · Batch Normalization · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Bottleneck Residual Block
