MATCH: An MPI Fault Tolerance Benchmark Suite
Luanzheng Guo, Giorgis Georgakoudis, Konstantinos Parasyris, Ignacio, Laguna, Dong Li

TL;DR
MATCH is a comprehensive benchmark suite designed to evaluate and compare various MPI fault tolerance techniques, providing insights into their performance and scalability for high-performance computing applications.
Contribution
The paper introduces MATCH, the first structured benchmark suite for systematically studying and comparing MPI fault tolerance methods across different scenarios.
Findings
Reinit recovery generally outperforms ULFM recovery.
Reinit recovery is scalable and problem-size independent.
Combining Reinit recovery with FTI checkpointing yields high efficiency.
Abstract
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Cloud Computing and Resource Management
