Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
Amrita Saha, Steven C.H. Hoi

TL;DR
This paper introduces a neural NLP-based approach to extract structured root cause knowledge from unstructured incident investigation documents, enabling effective retrieval-based root cause analysis for cloud services.
Contribution
It presents a novel Incident Causation Analysis (ICA) engine that converts unstructured PRB data into a structured causal knowledge graph for improved RCA.
Findings
ICA achieves high accuracy in extracting causal information.
The retrieval-based RCA pipeline effectively identifies root causes.
Domain experts validate the system's practical utility.
Abstract
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james · Independent Component Analysis
