MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
Chanakya Ekbote, Vijay Lingam, Sujay Sanghavi, Jun Huan, Behrooz Omidvar-Tehrani, Anoop Deoras, Stefano Soatto

TL;DR
MURPHY is a multi-turn reinforcement learning method that improves code generation by iteratively refining solutions with environmental feedback, achieving significant performance gains over prior methods.
Contribution
It introduces a novel multi-turn extension of GRPO that constructs feedback-conditioned rollout trees and propagates rewards, enabling self-correcting code generation.
Findings
Up to 6% absolute pass@1 gains over prior methods.
Largest gains on Medium/Hard subsets (+4.38/+4.20 at Iter-5).
Effective across multiple benchmarks and model families.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard recipe for post-training LLMs on reasoning tasks, with Group Relative Policy Optimization (GRPO) emerging as a leading approach. However, GRPO and its variants are inherently single-turn: they optimize from terminal rewards on isolated prompt-response pairs, leaving them poorly suited to agentic settings where models must iteratively refine solutions in response to environmental feedback. We introduce MURPHY, a multi-turn extension of GRPO for self-correcting code generation. MURPHY constructs feedback-conditioned rollout trees in which failed candidate solutions are paired with executor feedback and expanded into subsequent turns, and propagates rewards backward through the tree so that later successful refinements credit earlier attempts that surfaced informative feedback. We study two propagation strategies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
