Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
Jiawei Huang, Qingping Yang, Renjie Zheng, Jiaze Chen

TL;DR
This paper introduces a rubric-based Generative Reward Model that offers richer feedback than binary success signals, enhancing reinforcement learning for software engineering tasks with improved behavior shaping and accuracy.
Contribution
It proposes a novel rubric-based GRM that guides reinforcement fine-tuning of LLMs in SWE tasks, surpassing traditional terminal reward methods.
Findings
The rubric-based GRM better suppresses undesirable behaviors.
It promotes beneficial behaviors during training.
Final test accuracy improves with the proposed method.
Abstract
Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
