On the Trajectories of SGD Without Replacement
Pierfrancesco Beneventano

TL;DR
This paper analyzes SGD without replacement, revealing it acts like an additional regularizer that helps escape saddles faster and encourages sparsity in the Hessian spectrum, with implications for training stability.
Contribution
It introduces a theoretical framework showing SGD without replacement is equivalent to a regularized step, explaining its efficiency and spectral properties in training neural networks.
Findings
SGD without replacement accelerates saddle escape.
It regularizes the trace of the noise covariance.
Encourages sparsity in the Hessian spectrum.
Abstract
This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Blind Source Separation Techniques
MethodsStochastic Gradient Descent
