OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices
Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei

TL;DR
OpsAgent is a self-evolving multi-agent system designed for incident management in microservices, offering interpretability, cost-efficiency, and adaptability validated through extensive experiments and real-world deployment.
Contribution
It introduces a training-free data processing and a dual self-evolution mechanism, enabling generalizable, interpretable, and sustainable incident management in microservice systems.
Findings
Achieves state-of-the-art performance on OPENRCA benchmark.
Demonstrates generalizability and interpretability in diverse microservice environments.
Successfully deployed in Lenovo's production environment, validating industrial applicability.
Abstract
Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
