Width Provably Matters in Optimization for Deep Linear Neural Networks
Simon S. Du, Wei Hu

TL;DR
This paper proves that sufficient width in deep linear neural networks guarantees convergence of gradient descent to a global minimum, highlighting the importance of width for effective optimization.
Contribution
It establishes a width threshold ensuring linear convergence of gradient descent in deep linear networks, linking width to data rank, condition number, and output dimension.
Findings
Gradient descent converges linearly with sufficient width.
Wide layers are necessary for efficient deep network optimization.
Narrow networks have exponential lower bounds on optimization complexity.
Abstract
We prove that for an -layer fully-connected linear neural network, if the width of every hidden layer is , where and are the rank and the condition number of the input data, and is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an -suboptimal solution is . Our polynomial upper bound on the total running time for wide deep linear networks and the lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Face and Expression Recognition
