Gradient descent aligns the layers of deep linear networks

Ziwei Ji; Matus Telgarsky

arXiv:1810.02032·cs.LG·February 26, 2019·22 cites

Gradient descent aligns the layers of deep linear networks

Ziwei Ji, Matus Telgarsky

PDF

Open Access

TL;DR

This paper proves that gradient descent on deep linear networks leads to risk convergence and layer alignment, with the network's linear function approaching the maximum margin solution on linearly separable data.

Contribution

It establishes risk convergence and implicit regularization effects, including layer alignment and maximum margin convergence, for gradient flow and gradient descent on deep linear networks.

Findings

01

Risk converges to zero for gradient flow on decreasing loss functions.

02

Normalized weight matrices become rank-1 and aligned across layers.

03

The network's linear function converges to the maximum margin solution.

Abstract

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation $u_{i} v_{i}^{⊤}$ ; (iii) these rank-1 matrices are aligned across layers, meaning $∣ v_{i + 1}^{⊤} u_{i} ∣ \to 1$ . In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Neuroimaging Techniques and Applications