Convergence and Implicit Regularization Properties of Gradient Descent   for Deep Residual Networks

Rama Cont; Alain Rossier; RenYuan Xu

arXiv:2204.07261·cs.LG·January 26, 2023

Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks

Rama Cont, Alain Rossier, RenYuan Xu

PDF

Open Access

TL;DR

This paper establishes linear convergence of gradient descent for deep residual networks with constant width, revealing the implicit regularization effects and providing theoretical insights supported by numerical experiments.

Contribution

It proves convergence and regularization properties of gradient descent in deep residual networks, connecting the depth scaling limit to finite p-variation.

Findings

01

Gradient descent converges linearly to a global optimum.

02

The scaling limit of weights has finite p-variation with p=2.

03

Numerical experiments support theoretical results.

Abstract

We prove linear convergence of gradient descent to a global optimum for the training of deep residual networks with constant layer width and smooth activation function. We show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite $p -$ variation with $p = 2$ . Proofs are based on non-asymptotic estimates for the loss function and for norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical methods in inverse problems · Sparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques