On the Trajectories of SGD Without Replacement

Pierfrancesco Beneventano

arXiv:2312.16143·cs.LG·April 23, 2024·1 cites

On the Trajectories of SGD Without Replacement

Pierfrancesco Beneventano

PDF

Open Access

TL;DR

This paper analyzes SGD without replacement, revealing it acts like an additional regularizer that helps escape saddles faster and encourages sparsity in the Hessian spectrum, with implications for training stability.

Contribution

It introduces a theoretical framework showing SGD without replacement is equivalent to a regularized step, explaining its efficiency and spectral properties in training neural networks.

Findings

01

SGD without replacement accelerates saddle escape.

02

It regularizes the trace of the noise covariance.

03

Encourages sparsity in the Hessian spectrum.

Abstract

This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be $O (1)$ and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Blind Source Separation Techniques

MethodsStochastic Gradient Descent