Can Agent Intelligence be used to Achieve Fault Tolerant Parallel   Computing Systems?

Blesson Varghese; Gerard McKee; Vassil Alexandrov

arXiv:1308.2872·cs.DC·August 14, 2013

Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?

Blesson Varghese, Gerard McKee, Vassil Alexandrov

PDF

TL;DR

This paper explores using intelligent agents with cognitive capabilities to enhance fault tolerance in parallel computing systems, potentially offering an alternative to traditional checkpointing methods.

Contribution

It introduces an agent-based approach leveraging cognitive capabilities for fault tolerance, specifically applied to parallel reduction algorithms using MPI.

Findings

01

Preliminary results validate the feasibility of agent-based fault tolerance.

02

Agent capabilities can be effectively implemented for fault tolerance.

03

Parallel reduction algorithms benefit from cognitive agent integration.

Abstract

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, "What agent capabilities are required for fault tolerance?", "What parallel computational tasks can benefit from such agent capabilities?" and "How can agent capabilities be implemented for fault tolerance?" need to be addressed. Cognitive capabilities essential for achieving fault tolerance through agents are considered. Parallel reduction algorithms are identified as a class of algorithms that can benefit from cognitive agent capabilities. The Message Passing Interface is utilized for implementing an intelligent agent based approach. Preliminary results obtained from the experiments validate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.