Is the Multigrid Method Fault Tolerant? The Multilevel Case

Mark Ainsworth; Christian Glusa

arXiv:1607.08502·math.NA·May 27, 2019

Is the Multigrid Method Fault Tolerant? The Multilevel Case

Mark Ainsworth, Christian Glusa

PDF

TL;DR

This paper investigates the fault tolerance of multigrid methods at exascale computing, revealing their vulnerability and proposing strategies to enhance resilience, especially protecting the prolongation operation.

Contribution

It extends previous analysis to multigrid algorithms, identifying critical operations for fault resilience and proposing mitigation strategies with optimal parameter guidelines.

Findings

01

Multigrid method is not fault-tolerant without protection of the prolongation step.

02

Protecting the prolongation operation significantly improves fault resilience.

03

Strategies for fault detection and mitigation are effective when properly implemented.

Abstract

Computing at the exascale level is expected to be affected by a significantly higher rate of faults, due to increased component counts as well as power considerations. Therefore, current day numerical algorithms need to be reexamined as to determine if they are fault resilient, and which critical operations need to be safeguarded in order to obtain performance that is close to the ideal fault-free method. In a previous paper, a framework for the analysis of random stationary linear iterations was presented and applied to the two grid method. The present work is concerned with the multigrid algorithm for the solution of linear systems of equations, which is widely used on high performance computing systems. It is shown that the Fault-Prone Multigrid Method is not resilient, unless the prolongation operation is protected. Strategies for fault detection and mitigation as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.