Adaptive control in rollforward recovery for extreme scale multigrid

Markus Huber; Ulrich R\"ude; Barbara Wohlmuth

arXiv:1804.06373·cs.MS·April 18, 2018·Int. J. High Perform. Comput. Appl.

Adaptive control in rollforward recovery for extreme scale multigrid

Markus Huber, Ulrich R\"ude, Barbara Wohlmuth

PDF

TL;DR

This paper introduces an adaptive control mechanism for fault recovery in multigrid methods on exascale supercomputers, optimizing re-coupling after faults to minimize computational waste and ensure efficient parallel performance.

Contribution

It extends existing algorithm-based recovery for multigrid by integrating an adaptive, error-estimator-based stopping criterion for fault re-coupling, improving robustness and efficiency.

Findings

01

Successfully tested on systems with up to 6.9×10^{11} unknowns

02

Achieved fault recovery on over 245,766 parallel processes

03

Demonstrated robustness and efficiency of the adaptive control method

Abstract

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.