ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
Tenghui Ma, Jihu Guo, Wei Gao, Sitian Lu, Zhisheng Ye, Hanjing Wang, and Dahua Lin

TL;DR
ResiHP is a system that enhances large-scale LLM training resilience by accurately detecting failures and dynamically adapting parallelism strategies, significantly improving training throughput under GPU failures.
Contribution
It introduces a workload-aware failure detector and a dynamic scheduler for hybrid parallelism, addressing variability and failure detection inefficiencies in large-scale training.
Findings
ResiHP improves training throughput by 1.04-4.39× under failure scenarios.
The system effectively distinguishes failures from iteration time fluctuations.
ResiHP outperforms existing resilient training systems in diverse failure conditions.
Abstract
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
