Checkpointing to minimize completion time for Inter-dependent Parallel   Processes on Volunteer Grids

Mohammad Tanvir Rahman; Hien Nguyen; Jaspal Subhlok; Gopal Pandurangan

arXiv:1603.03502·cs.DC·March 14, 2016

Checkpointing to minimize completion time for Inter-dependent Parallel Processes on Volunteer Grids

Mohammad Tanvir Rahman, Hien Nguyen, Jaspal Subhlok, Gopal Pandurangan

PDF

Open Access

TL;DR

This paper develops a mathematical model to optimize checkpoint intervals, reducing completion time for inter-dependent parallel processes in volunteer computing environments, validated through real-world application testing.

Contribution

It introduces a novel mathematical model for determining optimal checkpoint intervals specifically for inter-dependent processes in volunteer grids, enhancing performance.

Findings

01

Predicted checkpoint intervals closely match empirically optimal ones.

02

Model effectively minimizes process completion time.

03

Validation on real-world applications confirms model accuracy.

Abstract

Volunteer computing is being used successfully for large scale scientific computations. This research is in the context of Volpex, a programming framework that supports communicating parallel processes in a volunteer environment. Redundancy and checkpointing are combined to ensure consistent forward progress with Volpex in this unique execution environment characterized by heterogeneous failure prone nodes and interdependent replicated processes. An important parameter for optimizing performance with Volpex is the frequency of checkpointing. The paper presents a mathematical model to minimize the completion time for inter-dependent parallel processes running in a volunteer environment by finding a suitable checkpoint interval. Validation is performed with a sample real world application running on a pool of distributed volunteer nodes. The results indicate that the performance with our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Distributed systems and fault tolerance