SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Lilin Wang; Lucas Ramalho; Alan Celestino; Phuc Anthony Pham; Yu Liu; Umang Kumar Sinha; Andres Portillo; Onassis Osunwa; Gabriel Maduekwe

arXiv:2512.17419·cs.SE·December 22, 2025

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, Gabriel Maduekwe

PDF

Open Access

TL;DR

SWE-Bench++ is an automated, scalable framework that generates multilingual, repository-level software engineering benchmarks from open-source GitHub projects, enabling better evaluation and training of large language models.

Contribution

It introduces an automated pipeline for creating dynamic, repository-level coding tasks from live pull requests across multiple languages, surpassing prior manual and static datasets.

Findings

01

Initial benchmark has 11,133 instances from 3,971 repositories.

02

State-of-the-art models achieve pass@10 scores around 36%.

03

Fine-tuning on SWE-Bench++ improves performance on existing benchmarks.

Abstract

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Scientific Computing and Data Management