Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann,, Xuchao Zhang, Saravan Rajmohan

TL;DR
This paper evaluates the effectiveness of large language models like GPT-3.x in assisting engineers with root cause analysis and mitigation of cloud incidents, demonstrating promising results through large-scale empirical study.
Contribution
First large-scale study applying GPT-3.x models to incident management, comparing various settings and validating with human evaluation at Microsoft.
Findings
Models outperform baseline in incident diagnosis
Human evaluation confirms model usefulness
Potential for AI-assisted incident resolution
Abstract
Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Data Security Solutions · Data Quality and Management
Methodstravel james · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Residual Connection · Dense Connections · Layer Normalization · Attention Dropout · Weight Decay
