Toward Resilient Algorithms and Applications
Michael A. Heroux

TL;DR
This paper explores four approaches to designing resilient algorithms capable of maintaining performance despite hardware failures, addressing concerns about the sustainability of reliable digital machine models in high performance computing.
Contribution
It introduces four novel strategies for creating algorithms that are resilient to failures, aiming to improve robustness in high performance computing environments.
Findings
Proposes four approaches for resilient algorithm development
Addresses challenges of maintaining reliability in digital models
Contributes to sustainable high performance computing
Abstract
Over the past decade, the high performance computing community has become increasingly concerned that preserving the reliable, digital machine model will become too costly or infeasible. In this paper we discuss four approaches for developing new algorithms that are resilient to hard and soft failures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Radiation Effects in Electronics · Interconnection Networks and Systems
