Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
Peter L. Bartlett, David P. Helmbold, Philip M. Long

TL;DR
This paper studies how deep residual networks with identity initialization learn positive definite linear transformations, providing bounds on convergence and highlighting cases where gradient descent succeeds or fails.
Contribution
It offers theoretical analysis of gradient descent for deep linear networks with identity initialization, especially for positive definite matrices, and identifies conditions for successful learning.
Findings
Polynomial convergence bounds for positive definite matrices
Failure of gradient descent for matrices with negative eigenvalues
Regularization towards identity does not always improve convergence
Abstract
We analyze algorithms for approximating a function mapping to using deep linear neural networks, i.e. that learn a function parameterized by matrices and defined by . We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix , in the case where the initial hypothesis has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
