Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing
Cheng Ji, Huaiying Luo

TL;DR
This paper introduces a novel AI framework utilizing Large Language Models for real-time fault detection and autonomous self-healing in cloud systems, improving accuracy and reducing downtime compared to traditional methods.
Contribution
It presents a new multi-level AI architecture combining LLMs with machine learning for proactive fault detection and self-healing in complex cloud environments.
Findings
Enhanced fault detection accuracy
Reduced system downtime
Faster recovery times
Abstract
With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of failure detection are often difficult to cope with the scale and dynamics of modern cloud environments. In this study, we propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems. The model combines existing machine learning fault detection algorithms with LLM's natural language understanding capabilities to process and parse system logs, error reports, and real-time data streams through semantic context. The method adopts a multi-level architecture, combined with supervised learning for fault classification and unsupervised learning for anomaly detection, so that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Cloud Computing and Resource Management
