# CausIL: Causal Graph for Instance Level Microservice Data

**Authors:** Sarthak Chakraborty, Shaddy Garg, Shubham Agarwal, Ayush Chauhan and, Shiv Kumar Saini

arXiv: 2303.00554 · 2023-03-21

## TL;DR

CausIL is a novel causal graph detection method tailored for cloud microservices, accounting for dynamic instances and system architecture, improving accuracy over existing techniques.

## Contribution

CausIL introduces a causal detection approach that incorporates system architecture and models distributed compute across dynamic microservice instances.

## Key findings

- Improves causal graph estimation accuracy by ~25% in simulations.
- Demonstrates effectiveness on real-world cloud system data.
- Handles dynamic instance counts and load balancing in causal modeling.

## Abstract

AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2303.00554/full.md

## Figures

42 figures with captions in the complete paper: https://tomesphere.com/paper/2303.00554/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/2303.00554/full.md

---
Source: https://tomesphere.com/paper/2303.00554