Resiliency in Numerical Algorithm Design for Extreme Scale Simulations
Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo, Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz,, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga,, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer

TL;DR
This paper discusses the challenges of ensuring resilience in numerical algorithms for exascale simulations, emphasizing the need for innovative approaches beyond traditional checkpointing due to scale-related failures.
Contribution
It highlights the importance of combining advanced system features with application-specific knowledge to develop resilient algorithms for extreme-scale computing.
Findings
Checkpointing overheads are prohibitive at exascale.
Fault detection and response require application-aware strategies.
Novel numerical and stochastic algorithms may enhance resilience.
Abstract
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Radiation Effects in Electronics · Advanced Data Storage Technologies
