Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?
Blesson Varghese, Gerard McKee, Vassil Alexandrov

TL;DR
This paper explores using intelligent agents with cognitive capabilities to enhance fault tolerance in parallel computing systems, potentially offering an alternative to traditional checkpointing methods.
Contribution
It introduces an agent-based approach leveraging cognitive capabilities for fault tolerance, specifically applied to parallel reduction algorithms using MPI.
Findings
Preliminary results validate the feasibility of agent-based fault tolerance.
Agent capabilities can be effectively implemented for fault tolerance.
Parallel reduction algorithms benefit from cognitive agent integration.
Abstract
The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, "What agent capabilities are required for fault tolerance?", "What parallel computational tasks can benefit from such agent capabilities?" and "How can agent capabilities be implemented for fault tolerance?" need to be addressed. Cognitive capabilities essential for achieving fault tolerance through agents are considered. Parallel reduction algorithms are identified as a class of algorithms that can benefit from cognitive agent capabilities. The Message Passing Interface is utilized for implementing an intelligent agent based approach. Preliminary results obtained from the experiments validate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
