Process Supervision-Guided Policy Optimization for Code Generation
Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin,, Guanlin Liu, Chen Dun, Liang Huang, Lin Yan

TL;DR
This paper introduces a Process Reward Model (PRM) that provides dense, line-level feedback during code generation, significantly improving reinforcement learning-based code synthesis, especially for complex, long-horizon tasks.
Contribution
The paper presents a novel PRM that offers dense feedback during code generation, enhancing RL training efficiency and effectiveness over sparse reward methods.
Findings
PRMs improve RL code generation performance
Using PRMs as dense rewards boosts learning efficiency
PRMs are especially effective for long-horizon tasks
Abstract
Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper defines a concrete procedure to generate process labels by binary search with best of $K$ completions, which turns sparse unit test feedback into dense signals during generation. The algorithm and labeling rule are explicit and testable. The PRM is trained with a simple regression loss, which is stable and easy to reproduce. The integration points in RL are well chosen, with clear reward shaping weights and length normalization. The experiments include two base models and three benchma
The paper relies on in house training data with about $30{\small,}000$ coding problems inside a broader RLHF set. Details on licenses, contamination checks, and overlap with evaluation sets are not given, so it is hard to judge fairness and generalization. Compute cost for PRM data collection and RL is not quantified. The best of $K$ completion step for labeling is a heuristic, and the paper does not report sensitivity to $K$ or to the sampling temperature during labeling. Reward shaping uses fi
- The paper identifies a well-known issue , sparse reward signals in code generation RL , and provides a human-inspired solution via process-level feedback. - The dual use of PRM as both a dense reward source and value function initializer is conceptually elegant and empirically validated. The binary search–based labeling algorithm is also efficient and interpretable. - The experiments cover multiple datasets and models (Qwen2.5-7B, InHouse-Lite), with detailed ablations on reward shaping, dat
- The PRM data collection relies on K = 20 best-of-K completions per prefix and binary search per line, which makes the process computationally heavy. The paper does not quantify this overhead (e.g., GPU hours, wall-clock time, or scaling behavior). Without this information, the claimed “practical” pipeline seems questionable for larger-scale deployments. - PRM labels depend entirely on unit-test coverage; when tests are incomplete, line-level feedback can be noisy or misleading. The authors
The paper presents a novel approach to finetune a LLM with RL for code generation.
I see two points here: 1) The authors train a PRM, which is compared to a baseline RL method where +1 reward is given to a code which passes *all* the unit tests, which given 0 otherwise. This baseline can be easily improved by giving a partial reward for some passes unit-tests. So in my opinion the current baseline is inadequately weak and should be replaced. 2) To train a PRM the authors use partial trajectories completed by an oracle. I am not sure about this approach, since the shorter pro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Model-Driven Software Engineering Techniques
