An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning
Ze Yang, Yihong Jin, Juntian Liu, Xinhe Xu

TL;DR
This paper presents an innovative fault self-healing mechanism for cloud AI systems that combines large language models and deep reinforcement learning to improve fault detection, semantic understanding, and adaptive recovery.
Contribution
It introduces a hybrid architecture integrating LLMs and DRL for fault interpretation and recovery, enhancing efficiency and adaptability in cloud AI system fault management.
Findings
Shortened recovery time by 37% in experiments
Improved fault identification accuracy with LLM integration
Enhanced adaptation to new failure modes
Abstract
As the scale and complexity of cloud-based AI systems continue to increase, the detection and adaptive recovery of system faults have become the core challenges to ensure service reliability and continuity. In this paper, we propose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integrates Large Language Model (LLM) and Deep Reinforcement Learning (DRL), aiming to realize a fault recovery framework with semantic understanding and policy optimization capabilities in cloud AI systems. On the basis of the traditional DRL-based control model, the proposed method constructs a two-stage hybrid architecture: (1) an LLM-driven fault semantic interpretation module, which can dynamically extract deep contextual semantics from multi-source logs and system indicators to accurately identify potential fault modes; (2) DRL recovery strategy optimizer, based on reinforcement learning, learns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Software-Defined Networks and 5G
Methodstravel james
