Toward fault-tolerant parallel-in-time integration with PFASST
Robert Speck, Daniel Ruprecht

TL;DR
This paper explores strategies to enhance fault tolerance in the PFASST parallel-in-time integration method, utilizing its multi-level hierarchy to recover from processor failures with minimal overhead and analyzing the efficiency of these strategies.
Contribution
It introduces and analyzes fault recovery strategies for PFASST, leveraging its multi-level structure to minimize overhead and improve resilience against processor failures.
Findings
Coarse level correction reduces recovery overhead.
Theoretical model links overhead to additional iterations.
Strategies tested on diffusive and advective problems.
Abstract
We introduce and analyze different strategies for the parallel-in-time integration method PFASST to recover from hard faults and subsequent data loss. Since PFASST stores solutions at multiple time steps on different processors, information from adjacent steps can be used to recover after a processor has failed. PFASST's multi-level hierarchy allows to use the coarse level for correcting the reconstructed solution, which can help to minimize overhead. A theoretical model is devised linking overhead to the number of additional PFASST iterations required for convergence after a fault. The potential efficiency of different strategies is assessed in terms of required additional iterations for examples of diffusive and advective type.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
