Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks
Rama Cont, Alain Rossier, RenYuan Xu

TL;DR
This paper establishes linear convergence of gradient descent for deep residual networks with constant width, revealing the implicit regularization effects and providing theoretical insights supported by numerical experiments.
Contribution
It proves convergence and regularization properties of gradient descent in deep residual networks, connecting the depth scaling limit to finite p-variation.
Findings
Gradient descent converges linearly to a global optimum.
The scaling limit of weights has finite p-variation with p=2.
Numerical experiments support theoretical results.
Abstract
We prove linear convergence of gradient descent to a global optimum for the training of deep residual networks with constant layer width and smooth activation function. We show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite variation with . Proofs are based on non-asymptotic estimates for the loss function and for norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical methods in inverse problems · Sparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques
