Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

Omkar Salpekar; Rohan Varma; Kenny Yu; Vladimir Ivanov; Yang Wang; Ahmed Sharif; Min Si; Shawn Xu; Feng Tian; Shengbao Zheng; Tristan Rice; Ankush Garg; Shangfu Peng; Shreyas Siravara; Wenyin Fu; Rodrigo de Castro; Adithya Gangidi; Andrey Obraztsov; Sharan Narang; Sergey Edunov; Maxim Naumov; Chunqiang Tang; Mathew Oldham

arXiv:2602.00277·cs.DC·February 3, 2026

Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, Tristan Rice, Ankush Garg, Shangfu Peng, Shreyas Siravara, Wenyin Fu, Rodrigo de Castro, Adithya Gangidi, Andrey Obraztsov, Sharan Narang, Sergey Edunov

PDF

Open Access

TL;DR

This paper introduces FT-HSDP, a fault-tolerant training paradigm for large-scale GPU clusters, significantly improving efficiency and recovery time during failures without sacrificing model accuracy.

Contribution

The paper proposes FT-HSDP, a novel fault-tolerant data parallel training method with new protocols for gradient exchange and replica recovery at scale.

Findings

01

Reduces failure recovery stall time from 10 to 3 minutes.

02

Increases effective training time from 44% to 80%.

03

Maintains model accuracy despite asynchronous recovery.

Abstract

Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques