Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift
Xiang Li, Shuo Chen, Xiaolin Hu, Jian Yang

TL;DR
This paper investigates why combining Dropout and Batch Normalization often degrades performance, revealing a variance shift issue that causes instability, and proposes strategies to mitigate this problem based on theoretical and empirical analysis.
Contribution
The paper uncovers the variance shift mechanism between Dropout and Batch Normalization, explaining their incompatibility, and proposes methods to improve their combined effectiveness.
Findings
Variance shift causes instability when Dropout is used before BN.
Experiments confirm the variance shift leads to worse performance.
Modified Dropout strategies can mitigate the variance shift issue.
Abstract
This paper first answers the question "why do the two most powerful techniques Dropout and Batch Normalization (BN) often lead to a worse performance when they are combined together?" in both theoretical and statistical aspects. Theoretically, we find that Dropout would shift the variance of a specific neural unit when we transfer the state of that network from train to test. However, BN would maintain its statistical variance, which is accumulated from the entire learning procedure, in the test phase. The inconsistency of that variance (we name this scheme as "variance shift") causes the unstable numerical behavior in inference that leads to more erroneous predictions finally, when applying Dropout before BN. Thorough experiments on DenseNet, ResNet, ResNeXt and Wide ResNet confirm our findings. According to the uncovered mechanism, we next explore several strategies that modifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
MethodsAverage Pooling · ResNeXt Block · Grouped Convolution · Bottleneck Residual Block · Global Average Pooling · Residual Block · *Communicated@Fast*How Do I Communicate to Expedia? · Kaiming Initialization · Max Pooling · 1x1 Convolution
