SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Minh V. T. Thai; Tue Le; Dung Nguyen Manh; Huy Phan Nhat; Nghi D. Q. Bui

arXiv:2512.18470·cs.SE·April 7, 2026

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui

PDF

1 Repo

TL;DR

SWE-EVO is a new benchmark designed to evaluate AI coding agents on long-horizon software evolution tasks involving multi-file, multi-step modifications in real-world projects.

Contribution

The paper introduces SWE-EVO, a benchmark based on open-source Python projects, highlighting the challenges current AI agents face in long-term software evolution tasks.

Findings

01

GPT-5.4 with OpenHands achieves only 25% on SWE-EVO.

02

Current agents struggle with multi-file, multi-step reasoning.

03

SWE-EVO exposes significant gaps in AI coding agent capabilities.

Abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bdqnghi/SWE-EVO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.