Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing

Cheng Ji; Huaiying Luo

arXiv:2505.11743·cs.DC·May 20, 2025

Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing

Cheng Ji, Huaiying Luo

PDF

Open Access

TL;DR

This paper introduces a novel AI framework utilizing Large Language Models for real-time fault detection and autonomous self-healing in cloud systems, improving accuracy and reducing downtime compared to traditional methods.

Contribution

It presents a new multi-level AI architecture combining LLMs with machine learning for proactive fault detection and self-healing in complex cloud environments.

Findings

01

Enhanced fault detection accuracy

02

Reduced system downtime

03

Faster recovery times

Abstract

With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of failure detection are often difficult to cope with the scale and dynamics of modern cloud environments. In this study, we propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems. The model combines existing machine learning fault detection algorithms with LLM's natural language understanding capabilities to process and parse system logs, error reports, and real-time data streams through semantic context. The method adopts a multi-level architecture, combined with supervised learning for fault classification and unsupervised learning for anomaly detection, so that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Cloud Computing and Resource Management