Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis,, Jonathan Mace

TL;DR
Atlas uses large language models to automatically generate causal graphs for cloud systems, improving fault localization efficiency and accuracy without extensive manual effort or reliance on incident data.
Contribution
The paper introduces Atlas, a novel LLM-based method for automatically synthesizing causal graphs in cloud systems, enhancing fault localization processes.
Findings
Atlas outperforms data-driven causal discovery methods in fault localization tasks.
Atlas generates scalable and generalizable causal graphs.
Performance of Atlas is comparable to ground-truth causal graphs.
Abstract
Runtime failure and performance degradation is commonplace in modern cloud systems. For cloud providers, automatically determining the root cause of incidents is paramount to ensuring high reliability and availability as prompt fault localization can enable faster diagnosis and triage for timely resolution. A compelling solution explored in recent work is causal reasoning using causal graphs to capture relationships between varied cloud system performance metrics. To be effective, however, systems developers must correctly define the causal graph of their system, which is a time-consuming, brittle, and challenging task that increases in difficulty for large and dynamic systems and requires domain expertise. Alternatively, automated data-driven approaches have limited efficacy for cloud systems due to the inherent rarity of incidents. In this work, we present Atlas, a novel approach to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Data Quality and Management · Service-Oriented Architecture and Web Services
