SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
Jackson Clark, Yiming Su, Saad Mohammad Rafid Pial, Yifang Tian, Lily Gniedziejko, Hans-Arno Jacobsen, Yinfang Chen, Tianyin Xu

TL;DR
SREGym is a comprehensive, live benchmark platform for AI SRE agents that simulates realistic failure scenarios in cloud-native environments to evaluate agent performance.
Contribution
It introduces SREGym, a modular, extensible framework with 90 real-world failure scenarios for evaluating AI SRE agents in complex production-like settings.
Findings
Frontier agents show up to 40% variation in failure mitigation performance.
SREGym effectively models diverse failure modes and environmental noise.
The benchmark is open-source and actively used by the research community.
Abstract
AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
