An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud
S. Jaya Nirmala, Amrith Rajagopal Setlur, Har Simrat Singh, Sudhanshu, Khoriya

TL;DR
This paper introduces a novel fault-tolerant workflow scheduling method in cloud environments that uses learned replication heuristics and checkpointing to enhance reliability and resource efficiency.
Contribution
It proposes an unsupervised learning-based replication heuristic combined with lightweight checkpointing for improved fault tolerance in cloud workflow scheduling.
Findings
Reduces resource wastage compared to Replicate-All
Maintains acceptable makespan increase over HEFT
Enhances workflow robustness in failure-prone environments
Abstract
Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing workflow scheduling algorithms do not provide the required reliability and robustness. In this paper, a new fault tolerant workflow scheduling algorithm that learns replication heuristics in an unsupervised manner has been proposed. Furthermore, the use of light weight synchronized checkpointing enables efficient resubmission of failed tasks and ensures workflow completion even in precarious environments. The proposed technique improves upon metrics like Resource Wastage and Resource Usage in comparison to the Replicate-All algorithm, while maintaining an acceptable increase in Makespan as compared to the vanilla Heterogeneous Earliest Finish Time (HEFT).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
