SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents
Mahir Labib Dihan, Md Ashrafur Rahman Khan

TL;DR
SWE-Shepherd introduces Process Reward Models to improve decision-making in code agents by providing dense, step-level feedback, enhancing efficiency and action quality in software engineering tasks.
Contribution
It presents a novel framework that trains lightweight reward models for guiding code agents, addressing limitations of static prompts and heuristics.
Findings
PRMs improve interaction efficiency in code agents.
Action-level rewards help guide better decision-making.
Challenges remain in aligning intermediate rewards with final success.
Abstract
Automating real-world software engineering tasks remains challenging for large language model (LLM)-based agents due to the need for long-horizon reasoning over large, evolving codebases and making consistent decisions across interdependent actions. Existing approaches typically rely on static prompting strategies or handcrafted heuristics to select actions such as code editing, file navigation, and test execution, but they lack fine-grained feedback on intermediate decisions. This leads to inefficient exploration, error propagation, and brittle solution trajectories. To address this limitation, we propose SWE-Shepherd, a framework that introduces Process Reward Models (PRMs) to provide dense, step-level supervision for repository-level code agents. Using trajectories from SWE-Bench, we construct an action-level reward dataset and train a lightweight reward model on a base LLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
