TL;DR
SWE-EVO is a new benchmark designed to evaluate AI coding agents on long-horizon software evolution tasks involving multi-file, multi-step modifications in real-world projects.
Contribution
The paper introduces SWE-EVO, a benchmark based on open-source Python projects, highlighting the challenges current AI agents face in long-term software evolution tasks.
Findings
GPT-5.4 with OpenHands achieves only 25% on SWE-EVO.
Current agents struggle with multi-file, multi-step reasoning.
SWE-EVO exposes significant gaps in AI coding agent capabilities.
Abstract
Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
