Energy-efficient localised rollback after failures via data flow analysis
Kiril Dichev, Kirk Cameron, Dimitrios Nikolopoulos

TL;DR
This paper introduces data-flow-driven recovery (DFR), a novel approach for localized rollback in HPC systems that reduces energy consumption by analyzing data flow, offering advantages over traditional log-based methods.
Contribution
The paper proposes DFR, a data-flow analysis technique for localized rollback, addressing limitations of log-based recovery methods in HPC systems.
Findings
DFR reduces energy consumption by 10-12% during local rollback.
DFR's energy savings increase quadratically with process count for stencil codes.
DFR outperforms global rollback in large-scale HPC scenarios.
Abstract
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data-flow-driven recovery (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data-flow graphs. We demonstrate the effectiveness of DFR for an MPI stencil code to optimise rollback and reduce the overall energy consumption by 10-12 % on idling nodes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Cloud Computing and Resource Management
