STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Junle Wang, Xingchuang Liao, Wenjun Wu

TL;DR
STAR is a framework that improves the reliability of LLM-based root cause analysis agents in microservices by decomposing workflows into stages, enabling targeted debugging and repair of errors.
Contribution
The paper introduces STAR, a stage-attributed triage and repair framework that localizes and fixes errors in RCA workflows, enhancing fault localization and repair efficiency.
Findings
STAR improves root cause localization accuracy.
STAR effectively repairs incorrect RCA traces within one or two rounds.
Stage-specific evaluation and repair significantly enhance RCA reliability.
Abstract
LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing},…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
