TL;DR
This paper introduces CIRCA, an unsupervised causal inference method for root cause analysis in online service systems, effectively identifying key fault indicators to improve failure mitigation.
Contribution
The paper formulates root cause analysis as intervention recognition and proposes CIRCA, a novel causal inference approach leveraging system architecture knowledge for online fault diagnosis.
Findings
CIRCA improves top-1 recommendation recall by 25% over baseline.
Simulation confirms CIRCA's theoretical reliability.
Constructs causal graphs based on system knowledge enhances analysis.
Abstract
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition. We proposed a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator, i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james
