Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution
KN Ajay Shastry, Ganesh Senrayan, Shrey Satapara, Pranoy Panda, Chaitanya Devaguptapu

TL;DR
This paper introduces a new framework and dataset for evaluating coding agents on long-term, sequential software development tasks, highlighting the limitations of existing isolated PR evaluations.
Contribution
It presents SWE-STEPS, a dataset and framework for assessing coding agents on long-horizon, dependent PR chains, reflecting real-world developer workflows.
Findings
Isolated PR evaluations overestimate success rates by up to 20%.
Agents tend to increase technical debt and complexity despite resolving issues.
Sequential evaluation reveals spillover effects ignored by existing benchmarks.
Abstract
Existing datasets for coding agents evaluate performance on isolated, single pull request (PR) tasks in a stateless manner, failing to capture the reality of real-world software development where code changes accumulate, technical debt accrues, and test suites grow over time. To bridge this gap, we introduce an automated coding task generation framework, which helps generate our dataset SWE-STEPS, that evaluates coding agents on long-horizon tasks through two realistic settings mirroring actual developer workflows: Conversational coding with iterative requests, and single-shot Project Requirement document (PRD)-based coding. Unlike existing datasets that evaluate agents on disjointed Pull Requests (PRs), our framework assesses performance across chains of dependent PRs, enabling evaluation of sequential execution, regression verification, and long-term repository health. We discover…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
