FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai, Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang

TL;DR
This paper introduces FALCON, a framework for quickly detecting and mitigating stragglers in large-scale hybrid-parallel GPU training, significantly reducing training slowdowns without human intervention.
Contribution
FALCON provides a novel, automated approach to identify and mitigate GPU and network fail-slows in large-scale training environments, improving efficiency.
Findings
FALCON detects fail-slows with over 99% accuracy.
Mitigates training slowdown by 60.1%.
Addresses CPU/GPU/network issues causing delays.
Abstract
Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that fail-slows are caused by various CPU/GPU computation and cross-node networking issues, lasting from tens of seconds to nearly ten hours, and collectively delaying the average job completion time by 1.34%. The current practice is to manually detect these fail-slows and simply treat them as fail-stops using a checkpoint-and-restart failover approach, which are labor-intensive and time-consuming. In this paper, we propose FALCON, a framework that rapidly identifies fail-slowed GPUs and/or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-Time Systems Scheduling · Real-time simulation and control systems · Parallel Computing and Optimization Techniques
