Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?
Boris Hanin

TL;DR
This paper rigorously analyzes how the architecture of randomly initialized neural networks influences gradient behavior, revealing conditions that lead to exploding or vanishing gradients based on network width and architecture.
Contribution
It provides a rigorous statistical analysis of gradient behavior in randomly initialized fully connected ReLU networks, extending mean field theory with finite width corrections.
Findings
Gradient variance grows exponentially with architecture-dependent constant beta
Large beta causes gradients to vary wildly at initialization
Finite width corrections are computed at the edge of chaos
Abstract
We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Neural dynamics and brain function · stochastic dynamics and bifurcation
Methods*Communicated@Fast*How Do I Communicate to Expedia?
