Convergence of gradient descent for learning linear neural networks

Gabin Maxime Nguegnang; Holger Rauhut; Ulrich Terstiege

arXiv:2108.02040·cs.LG·November 25, 2021·1 cites

Convergence of gradient descent for learning linear neural networks

Gabin Maxime Nguegnang, Holger Rauhut, Ulrich Terstiege

PDF

Open Access

TL;DR

This paper analyzes the convergence behavior of gradient descent in training deep linear neural networks, establishing conditions for convergence to critical points and global minima, with insights into the effects of network depth and initialization.

Contribution

It extends previous analysis to deep linear networks, showing convergence to critical points and global minima depending on network depth and initialization.

Findings

01

Gradient descent converges to critical points under suitable step size conditions.

02

For two-layer networks, gradient descent almost always reaches a global minimum.

03

In deeper networks, convergence is to a global minimum on fixed-rank matrix manifolds.

Abstract

We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Matrix Theory and Algorithms · Sparse and Compressive Sensing Techniques