Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure
Chaoyun Zhang, Randolph Yao, Si Qin, Ze Li, Shekhar Agrawal, Binit R., Mishra, Tri Tran, Minghua Ma, Qingwei Lin, Murali Chintalapati, Dongmei Zhang

TL;DR
Deoxys is a causal inference engine that recommends mitigation actions for unhealthy nodes in large-scale cloud infrastructure, significantly reducing VM downtime and interruptions by leveraging observational data and advanced causal modeling techniques.
Contribution
The paper introduces Deoxys, a novel causal inference engine using double machine learning and causal forests for effective unhealthy node mitigation in cloud systems.
Findings
Deoxys reduces VM downtime by 53% compared to legacy policies.
Deoxys lowers VM interruption rate by 49.5%.
Deployment in Microsoft cloud infrastructure demonstrates substantial reliability improvements.
Abstract
The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions
MethodsCausal inference
