Batch Normalization Biases Residual Blocks Towards the Identity Function   in Deep Networks

Soham De; Samuel L. Smith

arXiv:2002.10444·cs.LG·December 10, 2020·63 cites

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Soham De, Samuel L. Smith

PDF

Open Access 1 Video

TL;DR

This paper explains how batch normalization enables training very deep residual networks by biasing residual blocks towards the identity function at initialization, and introduces an alternative initialization scheme that removes the need for normalization.

Contribution

The authors reveal the mechanism by which batch normalization facilitates training deep residual networks and propose a simple initialization method to train such networks without normalization.

Findings

01

Batch normalization biases residual blocks towards the identity function at initialization.

02

Deep residual networks can be trained without normalization using a new initialization scheme.

03

Benefits of batch normalization's larger learning rates are limited to specific compute regimes.

Abstract

Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of the square root of the network depth. This ensures that, early in training, the function computed by normalized residual blocks in deep networks is close to the identity function (on average). We use this insight to develop a simple initialization scheme that can train deep residual networks without normalization. We also provide a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning

MethodsSkipInit · Batch Normalization