SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
Hao Guan, Lingyue Fu, Shao Zhang, Yaoming Zhu, Kangning Zhang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu

TL;DR
SWE-Cycle introduces a comprehensive benchmark and evaluation framework for autonomous code agents, assessing their performance across isolated tasks and an integrated end-to-end cycle, revealing key bottlenecks.
Contribution
This work presents SWE-Cycle, a novel benchmark with a new evaluation tool, SWE-Judge, for measuring autonomous code agents' capabilities across the entire development cycle.
Findings
Significant performance drop when moving from isolated tasks to full cycle execution.
Current state-of-the-art LLM-based agents struggle with cross-phase dependencies.
SWE-Judge effectively verifies code correctness and reduces measurement errors.
Abstract
As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation pipelines frequently fail when parsing fully autonomous trajectories. We address these limitations with SWE-Cycle, a benchmark of 489 rigorously filtered instances. SWE-Cycle evaluates agents across three isolated tasks, including environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three. The FullCycle task requires agents to work autonomously in a bare repository without human scaffolding. To reliably assess these complex execution paths, we developed SWE-Judge. By combining static code review with dynamic testing, this execution-capable evaluation agent accurately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
