AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis
Xiang Fu (Nanchang Hangkong University), Weiping Zhang (Nanchang, Hangkong University), Xin Huang (Nanchang Hangkong University), Wubiao Xu, (Nanchang Hangkong University), Shiman Meng (Nanchang Hangkong University),, Luanzheng Guo (Pacific Northwest National Laboratory)

TL;DR
AutoCheck is a tool that automatically identifies critical variables for checkpointing in HPC applications by analyzing data dependencies, simplifying the process for system engineers and scientists without requiring deep domain knowledge.
Contribution
It introduces an analytical model and heuristics to automatically determine variables crucial for checkpointing, improving efficiency and accessibility.
Findings
AutoCheck accurately identifies critical variables in 14 HPC benchmarks.
The tool reduces the time needed to pinpoint checkpoint variables to a few minutes.
AutoCheck demonstrates effectiveness across diverse HPC applications.
Abstract
Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds, and industrial data centers, which are typically operated by system engineers. Nevertheless, there is no existing approach that helps system engineers without domain expertise, and domain scientists without system fault tolerance knowledge identify those critical variables accounted for correct application execution restoration in a failure for C/R. To address this problem, we propose an analytical model and a tool (AutoCheck) that can automatically identify critical variables to checkpoint for C/R. AutoCheck relies on first, analytically tracking and optimizing data dependency between variables and other application execution state, and second, a set of heuristics that identify critical variables for checkpointing from the refined data dependency graph (DDG). AutoCheck allows programmers to pinpoint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Data Quality and Management
