A Fault Tolerance Mechanism for Hybrid Scientific Workflows
Alberto Mulone, Doriana Medi\'c, Marco Aldinucci

TL;DR
This paper introduces a fault tolerance mechanism for hybrid scientific workflows in distributed systems, enhancing reliability by implementing recovery and rollback strategies to handle frequent failures.
Contribution
It presents a novel fault tolerance approach tailored for hybrid workflows, addressing challenges posed by heterogeneous and distributed environments.
Findings
The mechanism effectively recovers from failures in hybrid workflows.
Experimental results demonstrate improved workflow reliability.
The approach supports heterogeneous and independent execution environments.
Abstract
In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application with a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability. A relevant feature that some WMSs supply is reliability. Over recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure in the execution increased, creating different important challenges that are interesting to study. This paper presents the implementation of a fault tolerance mechanism for hybrid workflows based on the recovery and rollback approach. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices
