AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis
Pei Yang, Wanyi Chen, Asuka Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Dongdong Zhang, Fuqiang Li, Alfred Long, Lynn Ai, Eric Yang, Bill Shi

TL;DR
AOI introduces a multi-agent framework that transforms failed operational trajectories into training signals, enhancing autonomous cloud diagnosis while addressing data privacy and safety constraints.
Contribution
This work presents AOI, a novel framework combining trajectory learning, privacy-preserving knowledge distillation, and failure-based data augmentation for improved SRE automation.
Findings
AOI achieves 66.3% success rate on benchmark tasks, surpassing previous methods.
Locally trained models reach 42.9% accuracy on unseen fault types, outperforming larger models.
Converting failed trajectories improves diagnostic accuracy and reduces variance.
Abstract
Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Adversarial Robustness in Machine Learning · Big Data and Digital Economy
