TrioXpert: An Automated Incident Management Framework for Microservice System
Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, Xi Luo

TL;DR
TrioXpert is an innovative framework that leverages multimodal data and large language models to improve incident management tasks in microservice systems, offering enhanced accuracy and interpretability.
Contribution
It introduces a multimodal, end-to-end incident management framework with a collaborative reasoning mechanism using LLMs for multiple tasks.
Findings
Significant performance improvements in anomaly detection, failure triage, and root cause localization.
Effective deployment in real-world production environment.
Enhanced interpretability with clear reasoning evidence.
Abstract
Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Software-Defined Networks and 5G
