Understanding Batch Normalization
Johan Bjorck, Carla Gomes, Bart Selman, Kilian Q. Weinberger

TL;DR
This paper empirically investigates how batch normalization improves deep neural network training by enabling larger learning rates, stabilizing activations, and preventing divergence, thus leading to faster convergence and better generalization.
Contribution
It provides a detailed empirical analysis showing that BN primarily allows larger learning rates and stabilizes activations, explaining its effectiveness in deep learning.
Findings
BN enables training with larger learning rates.
BN stabilizes activations to prevent divergence.
Large gradient updates cause instability in unnormalized networks.
Abstract
Batch normalization (BN) is a technique to normalize activations in intermediate layers of deep neural networks. Its tendency to improve accuracy and speed up training have established BN as a favorite technique in deep learning. Yet, despite its enormous success, there remains little consensus on the exact reason and mechanism behind these improvements. In this paper we take a step towards a better understanding of BN, following an empirical approach. We conduct several experiments, and show that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization. For networks without BN we demonstrate how large gradient updates can result in diverging loss and activations growing uncontrollably with network depth, which limits possible learning rates. BN avoids this problem by constantly correcting activations to be zero-mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · 1-Dimensional Convolutional Neural Networks
