The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions
George Philipp, Dawn Song, Jaime G. Carbonell

TL;DR
This paper challenges the common belief that techniques like Adam and batch normalization fully solve the exploding gradient problem, showing that it persists in many architectures and that residual networks with skip connections can effectively mitigate it, enabling deeper training.
Contribution
The paper provides a detailed analysis of the exploding gradient problem, introduces the residual trick, and explains why residual networks better handle exploding gradients, facilitating deeper neural network training.
Findings
Exploding gradients exist in many popular MLP architectures.
ResNets significantly reduce gradients, enabling deeper networks.
The residual trick simplifies understanding why skip connections help.
Abstract
Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks. We show this is a direct consequence of the Pythagorean equation. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Neural Network Applications · Brain Tumor Detection and Classification
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · 1x1 Convolution · Residual Connection · Max Pooling · Global Average Pooling · Bottleneck Residual Block · Residual Block · Kaiming Initialization · Convolution
