On the Training Instability of Shuffling SGD with Batch Normalization

David X. Wu; Chulhee Yun; Suvrit Sra

arXiv:2302.12444·cs.LG·August 15, 2023

On the Training Instability of Shuffling SGD with Batch Normalization

David X. Wu, Chulhee Yun, Suvrit Sra

PDF

Open Access 1 Video

TL;DR

This paper investigates how different SGD variants, Single Shuffle and Random Reshuffle, interact with batch normalization, revealing that RR offers more stable training and avoids divergence and distorted optima compared to SS.

Contribution

It provides a theoretical and empirical analysis of the contrasting behaviors of SS and RR with batch normalization, highlighting RR's stability advantages.

Findings

01

RR leads to more stable training loss evolution than SS.

02

SS can cause divergence and distorted optima in training.

03

Empirical validation confirms the theoretical differences in practical settings.

Abstract

We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Training Instability of Shuffling SGD with Batch Normalization· slideslive

Taxonomy

TopicsNeural Networks and Applications · Statistical Mechanics and Entropy · Face and Expression Recognition

MethodsBatch Normalization · Stochastic Gradient Descent