SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Yifeng Ding; Lingming Zhang

arXiv:2601.22129·cs.SE·February 6, 2026

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Yifeng Ding, Lingming Zhang

PDF

Open Access

TL;DR

SWE-Replay is a novel, efficient test-time scaling method for software engineering agents that recycles prior trajectories to reduce costs and improve performance without relying on noisy value estimates.

Contribution

It introduces SWE-Replay, the first generalizable and cost-effective test-time scaling technique that dynamically balances exploration and exploitation by reusing trajectories based on their significance.

Findings

01

Reduces scaling costs by up to 17.4%

02

Maintains or improves performance by up to 3.8%

03

Demonstrates effectiveness across multiple SWE benchmarks

Abstract

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Mobile Crowdsensing and Crowdsourcing