Training LLMs with Fault Tolerant HSDP on 100,000 GPUs
Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, Tristan Rice, Ankush Garg, Shangfu Peng, Shreyas Siravara, Wenyin Fu, Rodrigo de Castro, Adithya Gangidi, Andrey Obraztsov, Sharan Narang, Sergey Edunov

TL;DR
This paper introduces FT-HSDP, a fault-tolerant training paradigm for large-scale GPU clusters, significantly improving efficiency and recovery time during failures without sacrificing model accuracy.
Contribution
The paper proposes FT-HSDP, a novel fault-tolerant data parallel training method with new protocols for gradient exchange and replica recovery at scale.
Findings
Reduces failure recovery stall time from 10 to 3 minutes.
Increases effective training time from 44% to 80%.
Maintains model accuracy despite asynchronous recovery.
Abstract
Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
