Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models: Extension
Yunfei Teng, Wenbo Gao, Francois Chalus, Anna Choromanska, Donald, Goldfarb, Adrian Weller

TL;DR
This paper introduces a novel distributed training algorithm for deep learning that uses a leader-based gradient update, improving communication efficiency and convergence behavior over traditional methods.
Contribution
It proposes Leader Gradient Descent (LGD) and its stochastic and multi-leader variants, enhancing convergence, communication efficiency, and robustness in distributed deep learning training.
Findings
Outperforms state-of-the-art baselines in CNN training.
Reduces communication overhead by broadcasting only leader parameters.
Breaks symmetry issues in non-convex landscapes.
Abstract
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Memory and Neural Computing
