Automated Root Cause Analysis System for Complex Data Products
Mathieu Demarne, Miso Cilimdzic, Tom Falkowski, Timothy Johnson, Jim, Gramling, Wei Kuang, Hoobie Hou, Amjad Aryan, Gayatri Subramaniam, Kenny Lee,, Manuel Mejia, Lisa Liu, Divya Vermareddy

TL;DR
ARCAS is an automated root cause analysis platform that uses a specialized DSL and LLMs to quickly diagnose and mitigate issues in complex data products, reducing engineering effort and response time.
Contribution
The paper introduces ARCAS, a novel diagnostic system combining a domain-specific language and large language models for rapid, automated troubleshooting in data platforms.
Findings
Successfully applied across Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.
Reduces time-to-mitigate and engineering cycles compared to traditional monitoring tools.
Enables subject matter experts to create diagnostic guides without deep platform understanding.
Abstract
We present ARCAS (Automated Root Cause Analysis System), a diagnostic platform based on a Domain Specific Language (DSL) built for fast diagnostic implementation and low learning curve. Arcas is composed of a constellation of automated troubleshooting guides (Auto-TSGs) that can execute in parallel to detect issues using product telemetry and apply mitigation in near-real-time. The DSL is tailored specifically to ensure that subject matter experts can deliver highly curated and relevant Auto-TSGs in a short time without having to understand how they will interact with the rest of the diagnostic platform, thus reducing time-to-mitigate and saving crucial engineering cycles when they matter most. This contrasts with platforms like Datadog and New Relic, which primarily focus on monitoring and require manual intervention for mitigation. ARCAS uses a Large Language Model (LLM) to prioritize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Risk and Safety Analysis
MethodsFocus
