Why is parameter averaging beneficial in SGD? An objective smoothing perspective
Atsushi Nitanda, Ryuhei Kikuchi, Shugo Maeda, Denny Wu

TL;DR
This paper investigates why parameter averaging in SGD improves generalization, showing it effectively smooths the objective to avoid sharp minima, with theoretical proofs and experimental validation.
Contribution
It provides a theoretical explanation for the benefits of averaged SGD through an objective smoothing perspective, supported by empirical results.
Findings
Averaged SGD efficiently optimizes a smoothed objective avoiding sharp minima.
Parameter averaging with proper step size improves SGD performance.
Experimental results confirm the theoretical predictions.
Abstract
It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
