Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
Honghao Shi, Longkai Cheng, Wenli Wu, Yuhang Wang, Xuan Liu, Shaokai, Nie, Weixv Wang, Xuebin Min, Chunlei Men, Yonghua Lin

TL;DR
This paper presents an autonomous LLM-agent system that leverages advanced language models and innovative frameworks to improve the diagnosis and resolution of issues in AI clusters, demonstrating superior performance over traditional methods.
Contribution
The paper introduces a novel LLM-agent system with a specialized knowledge base, enhanced algorithms, and a new benchmark for cluster diagnostics, advancing autonomous troubleshooting capabilities.
Findings
Demonstrated improved accuracy in diagnosing cluster issues
Achieved higher efficiency in troubleshooting processes
Validated system effectiveness through extensive experiments
Abstract
Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting and rectifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare
MethodsBalanced Selection
