SWIFT: Expedited Failure Recovery for Large-scale DNN Training
Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu

TL;DR
SWIFT is a novel failure recovery method for large-scale distributed deep neural network training that reduces recovery time and overhead by leveraging model replicas and intermediate data logging, without sacrificing training speed or accuracy.
Contribution
SWIFT introduces a logging-based failure recovery approach that avoids copying model states, significantly reducing recovery overhead and accelerating training in large-scale DNNs.
Findings
Reduces failure recovery time significantly
Achieves up to 1.16x training speedup
Maintains model accuracy during recovery
Abstract
As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Age of Information Optimization
