Robust Root Cause Diagnosis using In-Distribution Interventions

Lokesh Nagalapatti; Ashutosh Srivastava; Sunita Sarawagi; Amit Sharma

arXiv:2505.00930·cs.LG·May 5, 2025

Robust Root Cause Diagnosis using In-Distribution Interventions

Lokesh Nagalapatti, Ashutosh Srivastava, Sunita Sarawagi, Amit Sharma

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces In-Distribution Interventions (IDI), a new root cause diagnosis algorithm that uses in-distribution interventional estimates to improve accuracy and robustness over traditional counterfactual-based methods, especially in rare anomaly scenarios.

Contribution

The paper proposes IDI, a novel root cause diagnosis method that relies on in-distribution interventions instead of counterfactuals, addressing issues with rare anomalies outside training data.

Findings

01

IDI outperforms nine state-of-the-art baselines in accuracy and robustness.

02

Theoretical bounds compare interventional and counterfactual estimation errors.

03

Experimental results show IDI's effectiveness on synthetic and real datasets.

Abstract

Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today's cloud services and industrial operations. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria: 1) **Anomaly:** root cause nodes should take on anomalous values; 2) **Fix:** had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. We present a theoretical analysis comparing and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The presence of hidden confounding factors makes the OOD problem both prevalent and significant. 2. The writing of this paper is clear and easy to read. 3. The paper provides proof that the IDI method's in-distribution sampling has a bounded error, scaling with the distance between anomalies and normality.

Weaknesses

See the following questions.

Reviewer 02Rating 6Confidence 3

Strengths

The three strengths of the paper are: 1. Whatever is presented in the paper is technically sound. Although I believe some clarifications are needed and the contribution is not enough (i.e., more content can be added). 2. **Main Strength:** The experiment section is very impressive. It is thorough and the ablation studies are performed well. 3. The main research question addressed in this paper i.e., counterfactual estimates can push SCMs to OOD regions, and sampling from in-distribution to in

Weaknesses

The methodological contribution of the paper is a major weakness. There are some weaknesses in the experiment section as well. 1. Sampling the latent exogenous variable from the (validation set) distribution as opposed to inverting the function $f_i$ (abduction) is important but a minor methodological tweak that might not warrant a full research paper. It is also unclear if "in-distribution intervention" approach is novel or has been proposed in the literature. The authors must show an analogy

Reviewer 03Rating 6Confidence 2

Strengths

- The paper provides a solid theoretical foundation for IDI, including error bounds and conditions where IDI surpasses counterfactual methods. - Experimental results consistently show that IDI outperforms baselines, highlighting its effectiveness and robustness in root cause diagnosis.

Weaknesses

- The presentation is complex and may be challenging to follow, particularly in Section 3, due to the heavy use of notation. Adding examples to clarify key concepts would improve readability. - While IDI demonstrates strong performance on synthetic and benchmark datasets, further validation in diverse, real-world industrial systems would strengthen its practical applicability. - The focus is on accuracy, but runtime and latency evaluations are limited. A discussion of computational overhead an

Videos

Robust Root Cause Diagnosis using In-Distribution Interventions· slideslive

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Fault Detection and Control Systems · Mineral Processing and Grinding