Fault Detection Engine in Intelligent Predictive Analytics Platform for DCIM
Bodhisattwa Prasad Majumder, Ayan Sengupta, Sajal jain, Parikshit, Bhaduri

TL;DR
This paper introduces a comprehensive Fault Detection Engine within an intelligent predictive analytics platform for data center infrastructure management, utilizing probabilistic models and machine learning to predict failures, identify root causes, and cluster devices for scalable real-time fault detection.
Contribution
The paper presents a novel architecture integrating failure prediction, root cause analysis, and community detection modules for fault diagnosis in large device networks.
Findings
Engine successfully predicts failure severity and survival probability.
Root cause analysis accurately identifies potential faulty devices.
Clustering reduces search space for fault localization.
Abstract
With the advancement of huge data generation and data handling capability, Machine Learning and Probabilistic modelling enables an immense opportunity to employ predictive analytics platform in high security critical industries namely data centers, electricity grids, utilities, airport etc. where downtime minimization is one of the primary objectives. This paper proposes a novel, complete architecture of an intelligent predictive analytics platform, Fault Engine, for huge device network connected with electrical/information flow. Three unique modules, here proposed, seamlessly integrate with available technology stack of data handling and connect with middleware to produce online intelligent prediction in critical failure scenarios. The Markov Failure module predicts the severity of a failure along with survival probability of a device at any given instances. The Root Cause Analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Network Security and Intrusion Detection · Complex Network Analysis Techniques
