ReZero is All You Need: Fast Convergence at Large Depth

Thomas Bachlechner; Bodhisattwa Prasad Majumder; Huanru Henry Mao,; Garrison W. Cottrell; Julian McAuley

arXiv:2003.04887·cs.LG·June 26, 2020·151 cites

ReZero is All You Need: Fast Convergence at Large Depth

Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao,, Garrison W. Cottrell, Julian McAuley

PDF

Open Access 5 Repos 1 Models

TL;DR

This paper introduces a simple gating mechanism in residual networks, called ReZero, which achieves fast convergence and improved performance in deep neural networks by maintaining dynamical isometry.

Contribution

The paper demonstrates that a single zero-initialized gating parameter in residual connections ensures dynamical isometry, enabling training of much deeper networks with faster convergence.

Findings

01

ReZero enables training thousands of layers in fully connected networks.

02

ReZero accelerates convergence by 56% in 12-layer Transformers.

03

ReZero improves test performance on CIFAR-10.

Abstract

Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
chrisjob1021/cnn-prelu-imagenet
model· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Neural Network Applications

MethodsLinear Layer · ReZero · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam