Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear   Networks

Wei Hu; Lechao Xiao; Jeffrey Pennington

arXiv:2001.05992·cs.LG·January 17, 2020·34 cites

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Wei Hu, Lechao Xiao, Jeffrey Pennington

PDF

Open Access

TL;DR

This paper proves that orthogonal initialization in deep linear networks guarantees faster convergence and requires less width for deep networks compared to Gaussian initialization, providing a theoretical basis for empirical practices.

Contribution

It offers the first rigorous proof that orthogonal initialization improves convergence speed in deep linear networks, independent of depth, unlike Gaussian initialization.

Findings

01

Orthogonal initialization speeds up convergence.

02

Width for efficient convergence is independent of depth with orthogonal init.

03

Gaussian init requires width to scale linearly with depth.

Abstract

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Gaussian Processes and Bayesian Inference