Width Provably Matters in Optimization for Deep Linear Neural Networks

Simon S. Du; Wei Hu

arXiv:1901.08572·cs.LG·May 28, 2019·32 cites

Width Provably Matters in Optimization for Deep Linear Neural Networks

Simon S. Du, Wei Hu

PDF

Open Access

TL;DR

This paper proves that sufficient width in deep linear neural networks guarantees convergence of gradient descent to a global minimum, highlighting the importance of width for effective optimization.

Contribution

It establishes a width threshold ensuring linear convergence of gradient descent in deep linear networks, linking width to data rank, condition number, and output dimension.

Findings

01

Gradient descent converges linearly with sufficient width.

02

Wide layers are necessary for efficient deep network optimization.

03

Narrow networks have exponential lower bounds on optimization complexity.

Abstract

We prove that for an $L$ -layer fully-connected linear neural network, if the width of every hidden layer is $\tilde{Ω} (L \cdot r \cdot d_{out} \cdot κ^{3})$ , where $r$ and $κ$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $ϵ$ -suboptimal solution is $O (κ lo g (\frac{1}{ϵ}))$ . Our polynomial upper bound on the total running time for wide deep linear networks and the $exp (Ω (L))$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Face and Expression Recognition