GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?
Yifang Tian, Yaming Liu, Zichun Chong, Zihang Huang, Hans-Arno Jacobsen

TL;DR
GALA is a multi-modal framework that enhances root cause analysis in microservice systems by combining causal inference with large language model reasoning, significantly improving diagnostic accuracy and actionability.
Contribution
GALA introduces a novel integration of causal inference and LLMs for multi-modal RCA, providing both accurate root cause detection and actionable remediation guidance.
Findings
Achieves up to 42.22% accuracy improvement over state-of-the-art methods.
Generates more causally sound and actionable diagnostic outputs.
Bridges automated diagnosis with practical incident resolution.
Abstract
Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single modalities or merely rank suspect services, falling short of providing actionable diagnostic insights with remediation guidance. This paper introduces GALA, a novel multi-modal framework that combines statistical causal inference with LLM-driven iterative reasoning for enhanced RCA. Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods of up to 42.22% accuracy. Our novel human-guided LLM evaluation score shows GALA generates significantly more causally sound and actionable diagnostic outputs than existing methods. Through comprehensive experiments and a case study, we show that GALA bridges the gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
