SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Yaxin Du; Yuzhu Cai; Yifan Zhou; Cheng Wang; Yu Qian; Xianghe Pang; Qian Liu; Yue Hu; Siheng Chen

arXiv:2505.16975·cs.SE·February 9, 2026

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces SWE-Dev, a large-scale dataset designed to evaluate and train autonomous feature-driven software development systems using real-world tasks, environments, and executable tests, highlighting significant room for improvement in current models.

Contribution

SWE-Dev is the first comprehensive dataset for end-to-end feature-driven development, enabling supervised fine-tuning and reinforcement learning with verifiable, executable tasks.

Findings

01

Best single-turn model achieves only 22.51% Pass@1 on hard tasks.

02

Multi-agent systems improve performance to 56.44%.

03

Many tasks remain unsolved, indicating substantial room for advancement.

Abstract

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks. However, feature-driven development, a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world end-to-end feature-driven software development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. We evaluated SWE-Dev across 17 base LLMs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dorothyduuu/swe-dev
noneOfficial

Datasets

Dorothydu/SWE-Dev
dataset· 89 dl
89 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Topic Modeling

MethodsSparse Evolutionary Training