Resilient by Design -- Active Inference for Distributed Continuum Intelligence
Praveen Kumar Donta, Alfreds Lapkovskis, Enzo Mingozzi, Schahram Dustdar

TL;DR
This paper proposes PAIR-Agent, a novel active inference-based framework designed to enhance resilience in distributed computing continuum systems by detecting, managing, and healing faults in real-time.
Contribution
It introduces a probabilistic active inference approach for fault detection and autonomous healing in complex distributed systems, combining causal modeling and adaptive reconfiguration.
Findings
The framework effectively detects faults using causal fault graphs.
It manages uncertainties with Markov blankets and free energy principles.
Theoretical validation confirms system reliability and resilience.
Abstract
Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This work-in-progress paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · IoT and Edge/Fog Computing
