An Efficient Fault Tolerant Workflow Scheduling Approach using   Replication Heuristics and Checkpointing in the Cloud

S. Jaya Nirmala; Amrith Rajagopal Setlur; Har Simrat Singh; Sudhanshu; Khoriya

arXiv:1810.06361·cs.DC·November 4, 2019

An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud

S. Jaya Nirmala, Amrith Rajagopal Setlur, Har Simrat Singh, Sudhanshu, Khoriya

PDF

TL;DR

This paper introduces a novel fault-tolerant workflow scheduling method in cloud environments that uses learned replication heuristics and checkpointing to enhance reliability and resource efficiency.

Contribution

It proposes an unsupervised learning-based replication heuristic combined with lightweight checkpointing for improved fault tolerance in cloud workflow scheduling.

Findings

01

Reduces resource wastage compared to Replicate-All

02

Maintains acceptable makespan increase over HEFT

03

Enhances workflow robustness in failure-prone environments

Abstract

Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing workflow scheduling algorithms do not provide the required reliability and robustness. In this paper, a new fault tolerant workflow scheduling algorithm that learns replication heuristics in an unsupervised manner has been proposed. Furthermore, the use of light weight synchronized checkpointing enables efficient resubmission of failed tasks and ensures workflow completion even in precarious environments. The proposed technique improves upon metrics like Resource Wastage and Resource Usage in comparison to the Replicate-All algorithm, while maintaining an acceptable increase in Makespan as compared to the vanilla Heterogeneous Earliest Finish Time (HEFT).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.