Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs
Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed, H. Chi, Jeffrey Pennington

TL;DR
This paper develops a mean field theory for LSTMs and GRUs to understand signal propagation, leading to a new initialization scheme that improves training stability and performance on long sequence tasks.
Contribution
The authors introduce a mean field theory for LSTMs and GRUs, deriving an initialization scheme that reduces training instabilities and enhances generalization.
Findings
New initialization scheme improves training stability on long sequences
Scheme enables successful training where standard methods fail
Observed better generalization with the proposed initialization
Abstract
Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and the GRU, do exhibit modest improvements over vanilla RNN cells, but they still suffer from instabilities when trained on very long sequences. In this work, we develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. By optimizing these quantities in terms of the initialization hyperparameters, we derive a novel initialization scheme that eliminates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Model Reduction and Neural Networks
MethodsSigmoid Activation · Tanh Activation · Gated Recurrent Unit · Long Short-Term Memory
