Is the Multigrid Method Fault Tolerant? The Two-Grid Case
Mark Ainsworth, Christian Glusa

TL;DR
This paper analyzes the fault tolerance of the multigrid method, specifically the two-grid case, providing a mathematical model for faults and identifying minimal remedial actions to restore convergence on exascale machines.
Contribution
It introduces the first mathematical model for faults in multigrid algorithms and analyzes the two-grid method's resilience, proposing minimal remedial actions for fault recovery.
Findings
Two-grid method fails to be resilient to faults.
Minimal remedial actions can restore convergence rate.
Mathematical model for faults in multigrid algorithms.
Abstract
The predicted reduced resiliency of next-generation high performance computers means that it will become necessary to take into account the effects of randomly occurring faults on numerical methods. Further, in the event of a hard fault occurring, a decision has to be made as to what remedial action should be taken in order to resume the execution of the algorithm. The action that is chosen can have a dramatic effect on the performance and characteristics of the scheme. Ideally, the resulting algorithm should be subjected to the same kind of mathematical analysis that was applied to the original, deterministic variant. The purpose of this work is to provide an analysis of the behaviour of the multigrid algorithm in the presence of faults. Multigrid is arguably the method of choice for the solution of large-scale linear algebra problems arising from discretization of partial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
