DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
Yuheng Tang, Kaijie Zhu, Bonan Ruan, Chuqi Zhang, Michael Yang, Hongwei Li, Suyue Guo, Tianneng Shi, Zekun Li, Christopher Kruegel, Giovanni Vigna, Dawn Song, William Yang Wang, Lun Wang, Yangruibo Ding, Zhenkai Liang, Wenbo Guo

TL;DR
DevOps-Gym is a comprehensive benchmark that evaluates AI agents across the entire software DevOps cycle, revealing current limitations and guiding future research in automating complex software workflows.
Contribution
This paper introduces DevOps-Gym, the first end-to-end benchmark for AI agents in DevOps, including real-world tasks and tool interfaces across multiple projects and languages.
Findings
State-of-the-art models struggle with issue resolving and test generation.
Models are unable to handle new tasks like monitoring and build configuration.
The benchmark reveals fundamental limitations of current AI agents in DevOps tasks.
Abstract
Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism…
Peer Reviews
Decision·ICLR 2026 Poster
The paper’s main strength lies in its clear motivation and solid contribution to an underexplored area. It identifies that while many existing benchmarks test AI on programming or debugging tasks, none assess performance across the complete DevOps pipeline, which involves building, monitoring, fixing, and testing software systems. By defining this broader scope, the authors push evaluation toward real-world DevOps scenarios that require understanding of environments, tool interaction, and sequen
The paper, while ambitious and valuable, has several weaknesses that limit its overall contribution. First, the scope is too broad relative to the experimental depth. Although it claims to evaluate the entire DevOps cycle, the number of tasks for some categories—such as monitoring or build/configuration—is relatively small compared to the hundreds of issue-resolving and test-generation cases. This uneven distribution makes the evaluation appear unbalanced, and it is unclear whether the benchmar
1. **Addresses Real Gap:** First benchmark attempting end-to-end DevOps evaluation 2. **Realistic Task Design:** Docker environments with real tools better simulate practice than synthetic environments, and covering diverse DevOps stages (though incomplete) 3. **Extensive Manual Effort:** Authors invested significant time in task construction and validation 4. **Negative Results Are Valuable:** Showing current agents fail badly on DevOps tasks is an important finding 5. **Practical Relevance:**
1. **No statistical rigor**: Missing confidence intervals, significance tests, multiple runs; single-run results unreliable given tiny task counts. 2. **Small dataset**: Only 30 monitoring and 48 build tasks, which is far smaller than benchmarks like SWE-bench, so this limits generalizability. 3. **Limited analysis**: The paper reports binary accuracy for monitoring, but there is no partial credit or false-positive analysis provided. 4. **Limited baselines**: Excludes major frameworks (Devin, Ai
- The paper makes a comprehensive contribution, covering all phases of DevOps operations. The benchmark is based on real-world repositories and targets languages common in industry like Java and Go (compared to benchmarks that target Python). The experimental setup is well-designed. - The paper is written very nicely. The key aspects of design of the benchmark are explained properly, showing both the diversity of sub-tasks within each task categories, and clearly specifying input, output and eva
- The benchmark curation involved synthetic data generation where faults are injected by experts. The paper claims that they are inspired by real-world scenarios, but the adequate details are not provided. - The paper uses OpenHands and mini-SWE-Agent as harnesses. Though they may allow access to the terminal, their tools are primarily designed for assisting agents in GitHub issue resolution. No effort is made to design an agentic harness that specifically aids the agent in DevOps tasks. - It is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Advanced Software Engineering Methodologies
