Determination of Checkpointing Intervals for Malleable Applications

K. Raghavendra; Sathish S Vadhiyar

arXiv:1711.00270·cs.DC·November 2, 2017

Determination of Checkpointing Intervals for Malleable Applications

K. Raghavendra, Sathish S Vadhiyar

PDF

Open Access

TL;DR

This paper develops a performance model to determine optimal checkpointing intervals for malleable parallel applications, enhancing efficiency and resilience in the presence of system failures.

Contribution

It introduces a novel performance model specifically for malleable applications that accounts for changing processor counts during execution.

Findings

01

Model-based checkpointing intervals improve application efficiency.

02

Simulations show high efficiency with the proposed intervals.

03

Applicable to real supercomputing system traces.

Abstract

Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for sequential applications while few efforts deal with parallel applications where the applications are executed on the same number of processors for the entire duration of execution. Some checkpointing systems support parallel applications where the number of processors on which the applications execute can be changed during the execution. We refer to these kinds of parallel applications as {\em malleable} applications. In this paper, we develop a performance model for malleable parallel applications that estimates the amount of useful work performed in unit time (UWT) by a malleable application in the presence of failures as a function of checkpointing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques