VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation
Longwen Wang, Xuan'er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

TL;DR
VeRPO introduces a verifiable, dense reward framework for code generation in reinforcement learning, improving performance by dynamically weighting partial success signals from unit tests, leading to more efficient and reliable training.
Contribution
The paper presents VeRPO, a novel RL method that constructs dense, verifiable rewards from execution feedback, addressing reward sparsity and misalignment issues in code generation.
Findings
Outperforms baselines with up to +8.83% pass@1 improvement.
Achieves this with negligible time and memory overhead.
Demonstrates robustness across diverse benchmarks.
Abstract
Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Software Engineering Research · Domain Adaptation and Few-Shot Learning
