Decentralized SGD and Average-direction SAM are Asymptotically Equivalent
Tongtian Zhu, Fengxiang He, Kaixuan Chen, Mingli Song, Dacheng Tao

TL;DR
This paper demonstrates that decentralized stochastic gradient descent (D-SGD) asymptotically behaves like an average-direction Sharpness-aware Minimization (SAM) algorithm, revealing benefits for generalization and posterior estimation in decentralized learning.
Contribution
It proves the asymptotic equivalence between D-SGD and average-direction SAM, providing new insights into decentralization benefits and regularization effects.
Findings
D-SGD implicitly minimizes an average-direction SAM loss.
Decentralization offers a free uncertainty evaluation mechanism.
Sharpness regularization in D-SGD does not diminish with larger batch sizes.
Abstract
Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non--smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Age of Information Optimization · Privacy-Preserving Technologies in Data
MethodsSharpness-Aware Minimization · Stochastic Gradient Descent
