SWE-RM: Execution-free Feedback For Software Engineering Agents
KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He

TL;DR
This paper introduces SWE-RM, a large-scale, execution-free reward model designed to improve software engineering agents by providing fine-grained feedback for both test-time scaling and reinforcement learning, surpassing previous methods.
Contribution
The paper develops SWE-RM, a 30-billion-parameter mixture-of-experts reward model that enhances the training and performance of SWE agents across multiple evaluation metrics.
Findings
SWE-RM significantly improves TTS and RL performance on SWE benchmarks.
It increases accuracy of existing models like Qwen3-Coder-Flash from 51.6% to 62.0%.
Achieves state-of-the-art results among open-source models.
Abstract
Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper's primary strength is its core insight, which is clearly motivated and empirically demonstrated. The finding that test-time scaling (TTS) performance is an insufficient proxy for a reward model's utility in RL is interesting. - The experiments are well-executed, a series of ablation is done on data scale, data composition, policy mixture and context length. This provides good guidance and insight for the community to identify the source of gain. - The writing is exceptionally well-
- Opacity of the "Poor Calibrated RM" (Verifier B): The entire paper's motivation hinges on the comparison between "Verifier A (Good AUC & Cali.)" and "Verifier B (Bad AUC & Cali.)". We are shown they have similar TTS but different RL outcomes. However, the paper never explains how Verifier B was trained or why it has bad calibration. Is it off-the-shelf (together with verifier A) or trained by the authors (if so could authors explain how to train it)? Is it the same as "poorly calibrated RM" in
- Showing that identical TTS performance can yield divergent RL outcomes is an important empirical finding with implications for verifier evaluation and selection. - The ablations on data scale, source composition, and context length provide concrete, actionable insights for building robust SWE reward models. - The model supports very long (256k) contexts so the verifier can score large numbers of trajectories. - A single verifier that supports both evaluation-time reranking and RL training is p
- The narrative sometimes shifts between TTS (used for evaluation/selection) and the execution-free RM (the actual trained model), which can make the pipeline slightly harder to follow. Being explicit about where TTS stops and supervised RM training begins would help. - Because the paper starts from TTS limitations, it would help to spell out the actual supervised reward-model objective earlier so readers don’t assume TTS is the training signal. - Evaluation is limited to SWE-Bench Verified, lea
The paper makes a reasonable observation that test-time scaling performance alone is insufficient for evaluating reward models intended for RL use. The empirical finding in Figure 2 showing that two verifiers with similar TTS can have drastically different RL performance is interesting and motivates the need for better evaluation criteria. The reported improvements are good - lifting Qwen3-Coder-Flash and Max on SWE-Bench Verified represents meaningful progress. Scaling the reward model to 256k
The paper does not adequately address a critical limitation: verifying program correctness can be as difficult as, or even harder than, executing the program. A reward model must essentially predict execution outcomes across potentially many different code paths and edge cases without actually running the code. This requires the model to simulate program semantics, which is fundamentally challenging and may introduce systematic errors. The paper does not discuss: How the reward model handles cor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices
