ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Tenghui Ma; Jihu Guo; Wei Gao; Sitian Lu; Zhisheng Ye; Hanjing Wang; and Dahua Lin

arXiv:2605.06374·cs.DC·May 12, 2026

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Tenghui Ma, Jihu Guo, Wei Gao, Sitian Lu, Zhisheng Ye, Hanjing Wang, and Dahua Lin

PDF

TL;DR

ResiHP is a system that enhances large-scale LLM training resilience by accurately detecting failures and dynamically adapting parallelism strategies, significantly improving training throughput under GPU failures.

Contribution

It introduces a workload-aware failure detector and a dynamic scheduler for hybrid parallelism, addressing variability and failure detection inefficiencies in large-scale training.

Findings

01

ResiHP improves training throughput by 1.04-4.39× under failure scenarios.

02

The system effectively distinguishes failures from iteration time fluctuations.

03

ResiHP outperforms existing resilient training systems in diverse failure conditions.

Abstract

Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.