Disentangling the Mechanisms Behind Implicit Regularization in SGD
Zachary Novack, Simran Kaur, Tanya Marwah, Saurabh Garg, Zachary C., Lipton

TL;DR
This paper empirically investigates the mechanisms behind why small-batch SGD generalizes better than large-batch SGD, focusing on the role of implicit regularization and how different regularizers affect generalization across datasets.
Contribution
It provides the first extensive empirical evaluation of various hypotheses on implicit regularization in SGD, highlighting the effectiveness of gradient norm and Fisher information regularizations.
Findings
Explicit regularization of gradient norm and Fisher trace recovers small-batch generalization.
Jacobian regularizations do not replicate small-batch benefits.
Regularization effects vary across datasets like CIFAR10 and CIFAR100.
Abstract
A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD)leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
Methodsfail · Test · Stochastic Gradient Descent
