Reinit++: Evaluating the Performance of Global-Restart Recovery Methods   For MPI Fault Tolerance

Giorgis Georgakoudis; Luanzheng Guo; Ignacio Laguna

arXiv:2102.06896·cs.DC·February 16, 2021

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna

PDF

Open Access

TL;DR

Reinit++ offers a fast, scalable global-restart recovery method for MPI applications that significantly outperforms traditional restart and ULFM approaches in fault tolerance scenarios.

Contribution

The paper introduces Reinit++, a novel global-restart recovery technique that avoids application re-deployment, improving recovery speed and scalability in MPI fault tolerance.

Findings

01

Reinit++ recovers up to 6x faster than restarting.

02

Reinit++ outperforms ULFM by up to 3x in recovery time.

03

Reinit++ scales effectively with increasing MPI processes.

Abstract

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Radiation Effects in Electronics