Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches
Haoze He, Parijat Dube

TL;DR
This paper introduces Non-blocking SGD, a decentralized training method that splits batches into mini-batches to reduce straggler delays, accelerating convergence in heterogeneous environments.
Contribution
It proposes a novel Non-blocking SGD approach that improves synchronization efficiency and convergence speed by splitting batches and using gradient accumulation in decentralized settings.
Findings
Speeds up training by up to 2x compared to state-of-the-art methods.
Effectively mitigates straggler delays in heterogeneous environments.
Maintains convergence guarantees with gradient accumulation.
Abstract
SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Distributed Control Multi-Agent Systems
MethodsStochastic Gradient Descent
