TrainMover: An Interruption-Resilient Runtime for ML Training
ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Zhengping Qian, Aditya Akella, Minlan Yu, Ennan Zhai, Dennis Cai, and Jingren Zhou

TL;DR
TrainMover is a novel runtime system for large-scale ML training that minimizes downtime and resource waste during interruptions by using elastic standby machines and innovative recovery techniques.
Contribution
It introduces a resilient training runtime with three key techniques to handle interruptions efficiently, reducing downtime and GPU-hour waste at scale.
Findings
Achieves around 20 seconds of downtime during interruptions at 1024-GPU scale.
Projects a 55% reduction in wasted GPU hours, saving 1.4 million GPU-hours weekly at 64K-GPU scale.
Demonstrates effective failure recovery with minimal performance degradation.
Abstract
Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Advanced Neural Network Applications
