TL;DR
This paper introduces Verifiable Process Rewards (VPR), a framework that uses intermediate verifiable signals to improve reinforcement learning in agentic reasoning tasks, leading to better performance and transferability.
Contribution
VPR converts verifiable intermediate actions into dense supervision signals, enhancing credit assignment and reasoning capabilities in large language models.
Findings
VPR outperforms outcome-level reward baselines in controlled environments.
VPR transfers effectively to general and agentic reasoning benchmarks.
Dense verifier-grounded rewards improve long-horizon credit assignment.
Abstract
Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
