
TL;DR
This paper introduces PRISM, a fast and effective root cause analysis method for complex systems without dependency graphs, outperforming existing approaches by a large margin in accuracy and speed.
Contribution
PRISM is a novel framework that performs accurate root cause analysis without dependency graphs, with theoretical guarantees and proven effectiveness on real-world data.
Findings
PRISM achieves 68% Top-1 accuracy on 735 failures.
PRISM improves accuracy by 258% over the best baseline.
PRISM diagnoses in only 8 milliseconds per failure.
Abstract
Failures in complex systems demand rapid Root Cause Analysis (RCA) to prevent cascading damage. Existing RCA methods that operate without dependency graph typically assume that the root cause having the highest anomaly score. This assumption fails when faults propagate, as a small delay at the root cause can accumulate into a much larger anomaly downstream. In this paper, we propose PRISM, a simple and efficient framework for RCA when the dependency graph is absent. We formulate a class of component-based systems under which PRISM performs RCA with theoretical guarantees. On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Anomaly Detection Techniques and Applications
