DaiFu: In-Situ Crash Recovery for Deep Learning Systems
Zilong He, Pengfei Chen, Hongyu Zhang, Xiaoyun Li, Guangba Yu, Hongyang Chen, Zibin Zheng

TL;DR
DaiFu is a lightweight in-situ crash recovery framework for deep learning systems that significantly reduces recovery time with minimal overhead, improving development efficiency and resource utilization.
Contribution
DaiFu introduces a novel in-situ recovery method using code transformation to enable instant crash recovery in deep learning systems.
Findings
Achieves 1372x faster recovery than existing solutions.
Overhead of DaiFu is less than 0.40%.
Effective across 7 different crash scenarios.
Abstract
Deep learning (DL) systems have been widely adopted in many areas, and are becoming even more popular with the emergence of large language models. However, due to the complex software stacks involved in their development and execution, crashes are unavoidable and common. Crashes severely waste computing resources and hinder development productivity, so efficient crash recovery is crucial. Existing solutions, such as checkpoint-retry, are too heavyweight for fast recovery from crashes caused by minor programming errors or transient runtime errors. Therefore, we present DaiFu, an in-situ recovery framework for DL systems. Through a lightweight code transformation to a given DL system, DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context (e.g., code, configurations, and other data) for agile crash recovery. Our evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Security and Verification in Computing · Adversarial Robustness in Machine Learning
