Straggler Tolerant and Resilient DL Training on Homogeneous GPUs
Zeyu Zhang, Haiying Shen

TL;DR
This paper investigates the causes and impacts of stragglers in homogeneous GPU-based deep learning training, and introduces STAR, a system that improves training efficiency and resilience by adaptive synchronization and resource management.
Contribution
The paper identifies key causes of stragglers in homogeneous GPU training and proposes STAR, a novel system with adaptive synchronization modes and resource reallocation to reduce TTA.
Findings
STAR reduces TTA by up to 84% compared to state-of-the-art.
STAR maintains convergence accuracy while improving training speed.
Proactive resource management prevents overloading and reduces stragglers.
Abstract
Despite the popularity of homogeneous GPU-based deep learning (DL) training, the prevalence, causes and impact of stragglers and the effectiveness of existing straggler mitigation approaches are still not well understood in this scenario due to limited research on these questions. To fill this gap, we conducted comprehensive experiments and found that stragglers remain widespread due to CPU and bandwidth usage imbalances. Additionally, existing mitigation methods that switch from synchronous stochastic gradient descent (SSGD) to asynchronous SGD (ASGD) may not improve Time-To-Accuracy (TTA) and can even generate more stragglers due to its higher resource consumption. To address these newly found problems, we propose the Straggler Tolerant And Resilient DL training system (STAR). STAR includes new synchronization modes that group workers for each parameter updating. It has a heuristic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing
