SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

Satyen Kale; Ayush Sekhari; Karthik Sridharan

arXiv:2107.05074·cs.LG·July 13, 2021

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

Satyen Kale, Ayush Sekhari, Karthik Sridharan

PDF

Open Access 1 Video

TL;DR

This paper investigates the theoretical underpinnings of SGD in convex optimization, challenging the implicit regularization explanation, and demonstrates how multiple epochs can improve learning outcomes, with implications for deep learning.

Contribution

It provides theoretical separations between SGD and other methods, showing multiple epochs can outperform single-pass SGD and refutes implicit regularization as the sole explanation for SGD's success.

Findings

01

Implicit regularization does not always explain SGD's success.

02

Multiple epochs can significantly improve learning in certain problems.

03

SGD can outperform regularized empirical risk minimization in some settings.

Abstract

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of $O (1/ n)$ , where $n$ is the number of samples. In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD. Our main contributions are threefold: (a) We show that for any regularizer, there is an SCO problem for which Regularized Empirical Risk Minimzation fails to learn. This automatically rules out any implicit regularization based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Advanced Multi-Objective Optimization Algorithms · Model Reduction and Neural Networks

MethodsStochastic Gradient Descent