VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

Longwen Wang; Xuan'er Wu; Xiaohui Hu; Yirui Liu; Yuankai Fan; Kaidong Yu; Qizhen Weng; Wei Xi; Xuelong Li

arXiv:2601.03525·cs.LG·January 12, 2026

VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

Longwen Wang, Xuan'er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

PDF

Open Access

TL;DR

VeRPO introduces a verifiable, dense reward framework for code generation in reinforcement learning, improving performance by dynamically weighting partial success signals from unit tests, leading to more efficient and reliable training.

Contribution

The paper presents VeRPO, a novel RL method that constructs dense, verifiable rewards from execution feedback, addressing reward sparsity and misalignment issues in code generation.

Findings

01

Outperforms baselines with up to +8.83% pass@1 improvement.

02

Achieves this with negligible time and memory overhead.

03

Demonstrates robustness across diverse benchmarks.

Abstract

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Software Engineering Research · Domain Adaptation and Few-Shot Learning