TL;DR
This paper introduces R4P, a reasoning-based patch verifier that provides scalable rewards for training software engineering agents, significantly improving verification speed and accuracy over traditional test-based methods.
Contribution
The paper presents R4P, a novel reasoning-based patch verifier that enables scalable, efficient, and accurate supervision of software agents without relying on heavy test sandboxing.
Findings
R4P achieves 72.2% accuracy in patch verification.
Mini-SE with R4P improves Pass@1 to 32.8%.
R4P verifies patches 50x faster than testing.
Abstract
While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a…
Peer Reviews
Decision·Submitted to ICLR 2026
* The analysis in 5.3 is really comprehensive * Experiments prove the efficacy of R4P on a simple agent scaffold, and strong performance as a verifier when compared with existing non-reasoning verifiers
* Figure 3 shows rewards for test data. Test data should be used in experiments very sparingly, to avoid compromising integrity of conclusions about model generalization * Evaluation is only performed on the mini-SE scaffold, rather than other scaffolds that achieve stronger base performance on SWE-bench-verified. Would R4P still be useful in these other scaffolds? Do these scaffolds make R4P more difficult because they include much longer trajectories, which are more difficult to reason about (
1. Innovative Test-Free Supervision Paradigm. R4P redefines software agent supervision as a reasoning task, eliminating the dependency on sandbox testing. This shift addresses scalability, cost, and fragility in existing solutions. 2. The reward model design is novel.
1. Scope of this work can be better elaborated. 2. The evaluation can be more comprehensive. My major concen of this work is clarity and evaluation, I believe these shortcomings can be overcame before submitting the camera-ready version. Why the reward design is technically sound and how it affects the learning? I think this is important when designing a reward function for reinforcement learning, and maybe it is better to elaborate that it is aligned with your objective to avoid reward hackin
- The group-wise reasoning objective transforms sparse binary verification into a dense, stable reward signal - The paper provides ample evidence showing the advantage of R4P, along with nice ablation studies to analyze the behavior of R4P model
- The reward model is fixed post-training, leading to potential reward drift as agents improve. It will be interesting to understand the RL behavior when you overtrain the model with such a static reward model model - In Fig. 9, it will be good to draw the confidence interval to see if the trend is significant. The bins to the right have too few samples, which makes the conclusion that "verification accuracy positively correlates with/ number of edited lines" a bit ungrounded - Despite the two
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
