Adaptive control in rollforward recovery for extreme scale multigrid
Markus Huber, Ulrich R\"ude, Barbara Wohlmuth

TL;DR
This paper introduces an adaptive control mechanism for fault recovery in multigrid methods on exascale supercomputers, optimizing re-coupling after faults to minimize computational waste and ensure efficient parallel performance.
Contribution
It extends existing algorithm-based recovery for multigrid by integrating an adaptive, error-estimator-based stopping criterion for fault re-coupling, improving robustness and efficiency.
Findings
Successfully tested on systems with up to 6.9×10^{11} unknowns
Achieved fault recovery on over 245,766 parallel processes
Demonstrated robustness and efficiency of the adaptive control method
Abstract
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
