Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

TL;DR
This paper uses free probability theory to analyze how weight initialization and nonlinearities affect the singular value distribution of Jacobians in deep networks, revealing that sigmoidal networks can achieve dynamical isometry and learn faster than ReLU networks.
Contribution
It extends the concept of dynamical isometry to deep nonlinear networks using free probability, showing sigmoidal networks can achieve it with orthogonal initialization and improve learning speed.
Findings
ReLU networks cannot achieve dynamical isometry.
Sigmoidal networks can achieve isometry with orthogonal initialization.
Dynamically isometric networks learn significantly faster.
Abstract
It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Gaussian Processes and Bayesian Inference
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · *Communicated@Fast*How Do I Communicate to Expedia?
