TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud
Yitao Yang, Yangtao Deng, Yifan Xiong, Baochun Li, Hong Xu, Peng Cheng

TL;DR
TSGuard is a user-centric system that provides immediate, accurate incident diagnosis for AI workloads in the cloud by leveraging historical data and structured reasoning, significantly reducing diagnosis time.
Contribution
It introduces a novel multi-agent system that constructs knowledge bases from past incidents and mimics human diagnosis, improving speed and accuracy.
Findings
Diagnostic accuracy improved by 19.8% over baselines.
Verification time reduced by 63.4%.
Effective in real Microsoft Azure incident records.
Abstract
AI workloads incur frequent failures and incidents from the underlying infrastructure. The current incident management workflow follows a provider-centric paradigm, where users report incidents to the infrastructure provider who then conducts troubleshooting. Due to the large number of incidents and the manual nature of the troubleshooting process, the provider often takes several days to resolve an incident, resulting in operational delays and productivity loss. To address these challenges, we present TSGuard, a user-centric multi-agent system that delivers immediate incident diagnosis to users who deploy the workloads. The core innovation of TSGuard is twofold: (1) constructing domain-specific knowledge bases by mining historical on-call experiences in the offline phase, and (2) mimicking human expert diagnosis via structured reasoning and iterative trial-and-error in the online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
