Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Guangtao Zeng; Maohao Shen; Delin Chen; Zhenting Qi; Subhro Das; Dan Gutfreund; David Cox; Gregory Wornell; Wei Lu; Zhang-Wei Hong; Chuang Gan

arXiv:2505.23604·cs.CL·May 30, 2025

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan

PDF

Open Access 1 Repo 1 Models

TL;DR

EvoScale is a sample-efficient evolutionary method that improves smaller language models' performance on software engineering tasks by self-evolving outputs through reinforcement learning, reducing the need for extensive sampling.

Contribution

It introduces EvoScale, a novel reinforcement learning-based approach that enables smaller language models to self-improve and match larger models' performance in software engineering tasks.

Findings

01

32B model matches or exceeds 100B models on SWE-Bench.

02

EvoScale reduces sampling requirements significantly.

03

Self-evolving models achieve high accuracy with fewer resources.

Abstract

Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

satori-reasoning/satori-swe
noneOfficial

Models

🤗
Satori-reasoning/Satori-SWE-RL-32B
model· 4 dl· ♡ 3
4 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification · Software Testing and Debugging Techniques