Resilience for Exascale Enabled Multigrid Methods
Markus Huber, Bj\"orn Gmeiner, Ulrich R\"ude, Barbara Wohlmuth

TL;DR
This paper explores algorithm-based fault tolerance for multigrid methods in exascale supercomputers, proposing local recovery strategies and a superman approach to handle faults efficiently.
Contribution
It introduces novel fault recovery strategies for multigrid solvers, including local subproblem solutions and a superman approach to reduce downtime.
Findings
Local recovery effectively mitigates fault impacts.
Superman strategy reduces recovery time significantly.
Fault-tolerant multigrid methods are feasible for exascale computing.
Abstract
With the increasing number of components and further miniaturization the mean time between faults in supercomputers will decrease. System level fault tolerance techniques are expensive and cost energy, since they are often based on redundancy. Also classical check-point-restart techniques reach their limits when the time for storing the system state to backup memory becomes excessive. Therefore, algorithm-based fault tolerance mechanisms can become an attractive alternative. This article investigates the solution process for elliptic partial differential equations that are discretized by finite elements. Faults that occur in the parallel geometric multigrid solver are studied in various model scenarios. In a standard domain partitioning approach, the impact of a failure of a core or a node will affect one or several subdomains. Different strategies are developed to compensate the effect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed systems and fault tolerance
