ReZero is All You Need: Fast Convergence at Large Depth
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao,, Garrison W. Cottrell, Julian McAuley

TL;DR
This paper introduces a simple gating mechanism in residual networks, called ReZero, which achieves fast convergence and improved performance in deep neural networks by maintaining dynamical isometry.
Contribution
The paper demonstrates that a single zero-initialized gating parameter in residual connections ensures dynamical isometry, enabling training of much deeper networks with faster convergence.
Findings
ReZero enables training thousands of layers in fully connected networks.
ReZero accelerates convergence by 56% in 12-layer Transformers.
ReZero improves test performance on CIFAR-10.
Abstract
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Neural Network Applications
MethodsLinear Layer · ReZero · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam
