DC-S3GD: Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training
Alessandro Rigazzi

TL;DR
This paper introduces DC-S3GD, a decentralized stale-synchronous SGD method that overlaps computation and communication, compensates for errors, and achieves state-of-the-art results in large-scale neural network training.
Contribution
It presents a novel decentralized stale-synchronous SGD algorithm with delay compensation and gradient correction, improving training efficiency and accuracy.
Findings
Achieved state-of-the-art results on CNN training with large batches.
Demonstrated effective overlap of computation and communication.
Validated the approach's effectiveness through theoretical analysis and experiments.
Abstract
Data parallelism has become the de facto standard for training Deep Neural Network on multiple processing units. In this work we propose DC-S3GD, a decentralized (without Parameter Server) stale-synchronous version of the Delay-Compensated Asynchronous Stochastic Gradient Descent (DC-ASGD) algorithm. In our approach, we allow for the overlap of computation and communication, and compensate the inherent error with a first-order correction of the gradients. We prove the effectiveness of our approach by training Convolutional Neural Network with large batches and achieving state-of-the-art results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
