RCA Copilot: Transforming Network Data into Actionable Insights via Large Language Models
Alexander Shan, Jasleen Kaur, Rahul Singh, Tarun Banka, Raj Yavatkar, T. Sridhar

TL;DR
RCACopilot leverages large language models combined with statistical methods to automate root cause analysis in complex network environments, providing clear explanations and actionable insights to improve reliability and reduce manual effort.
Contribution
This paper introduces RCACopilot, a novel system that integrates LLM reasoning with statistical tests to automate and explain network root cause analysis.
Findings
RCACopilot achieves high accuracy in identifying network root causes.
The system provides clear explanations and actionable steps for engineers.
It demonstrates effectiveness across diverse network environments.
Abstract
Ensuring the reliability and availability of complex networked services demands effective root cause analysis (RCA) across cloud environments, data centers, and on-premises networks. Traditional RCA methods, which involve manual inspection of data sources such as logs and telemetry data, are often time-consuming and challenging for on-call engineers. While statistical inference methods have been employed to estimate the causality of network events, these approaches alone are similarly challenging and suffer from a lack of interpretability, making it difficult for engineers to understand the predictions made by black-box models. In this paper, we present RCACopilot, an advanced on-call system that combines statistical tests and large language model (LLM) reasoning to automate RCA across various network environments. RCACopilot gathers and synthesizes critical runtime diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
