STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds
Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, Tianyin Xu

TL;DR
STRATUS is an AI-driven multi-agent system designed for autonomous reliability engineering in cloud environments, significantly improving failure mitigation success rates over existing methods.
Contribution
The paper introduces STRATUS, a novel multi-agent system utilizing LLMs for autonomous Site Reliability Engineering in clouds, with formal safety guarantees and superior performance.
Findings
STRATUS outperforms state-of-the-art SRE agents by at least 1.5x in success rate.
Formalization of Transactional No-Regression (TNR) enhances safe failure mitigation.
STRATUS demonstrates practical potential for deployment in cloud reliability management.
Abstract
In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Reliability and Analysis Research · Cloud Data Security Solutions · Cloud Computing and Resource Management
