Scrutinizing Variables for Checkpoint Using Automatic Differentiation
Xin Huang, Weiping Zhang, Shiman Meng, Wubiao Xu, Xiang Fu, Luanzheng Guo, Kento Sato

TL;DR
This paper introduces a systematic method using automatic differentiation to identify critical data within variables for checkpointing, significantly reducing storage needs in HPC applications.
Contribution
It presents a novel approach leveraging AD to scrutinize each variable element for checkpointing, enabling selective data saving and improving efficiency.
Findings
Up to 20% storage savings in checkpointing.
Effective visualization of critical/uncritical regions.
Patterns align with algorithm logic.
Abstract
Checkpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused data should be excluded from checkpointing for better storage/compute efficiency. To find out, we propose a systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables (e.g., arrays) for checkpointing allowing us to identify critical/uncritical elements and eliminate uncritical elements from checkpointing. Specifically, we inspect every single element within a variable for checkpointing with an AD tool to determine whether the element has an impact on the application output or not. We empirically validate our approach with eight benchmarks from the NAS Parallel Benchmark (NPB) suite. We successfully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Logic, programming, and type systems
