Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs
Jonathan Frankle, David J. Schwab, and Ari S. Morcos

TL;DR
Training only the affine parameters of BatchNorm in CNNs reveals significant expressive power, enabling high performance even when all other weights are fixed at random, highlighting the importance of feature normalization.
Contribution
This paper demonstrates that training only BatchNorm affine parameters, with all other weights fixed randomly, achieves surprisingly high accuracy, revealing the expressive capacity of feature normalization.
Findings
ResNets reach 82% accuracy on CIFAR-10 when training only BatchNorm parameters.
BatchNorm enables networks to learn to disable a third of random features.
Training affine parameters alone outperforms training an equivalent number of random parameters elsewhere.
Abstract
A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection
