SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs
Satyen Kale, Ayush Sekhari, Karthik Sridharan

TL;DR
This paper investigates the theoretical underpinnings of SGD in convex optimization, challenging the implicit regularization explanation, and demonstrates how multiple epochs can improve learning outcomes, with implications for deep learning.
Contribution
It provides theoretical separations between SGD and other methods, showing multiple epochs can outperform single-pass SGD and refutes implicit regularization as the sole explanation for SGD's success.
Findings
Implicit regularization does not always explain SGD's success.
Multiple epochs can significantly improve learning in certain problems.
SGD can outperform regularized empirical risk minimization in some settings.
Abstract
Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of , where is the number of samples. In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD. Our main contributions are threefold: (a) We show that for any regularizer, there is an SCO problem for which Regularized Empirical Risk Minimzation fails to learn. This automatically rules out any implicit regularization based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Advanced Multi-Objective Optimization Algorithms · Model Reduction and Neural Networks
MethodsStochastic Gradient Descent
