SWIFT: Expedited Failure Recovery for Large-scale DNN Training

Yuchen Zhong; Guangming Sheng; Juncheng Liu; Jinhui Yuan; and Chuan Wu

arXiv:2302.06173·cs.DC·August 26, 2024·1 cites

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu

PDF

Open Access 1 Repo

TL;DR

SWIFT is a novel failure recovery method for large-scale distributed deep neural network training that reduces recovery time and overhead by leveraging model replicas and intermediate data logging, without sacrificing training speed or accuracy.

Contribution

SWIFT introduces a logging-based failure recovery approach that avoids copying model states, significantly reducing recovery overhead and accelerating training in large-scale DNNs.

Findings

01

Reduces failure recovery time significantly

02

Achieves up to 1.16x training speedup

03

Maintains model accuracy during recovery

Abstract

As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jasperzhong/swift
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Age of Information Optimization