Rewarding Progress: Scaling Automated Process Verifiers for LLM   Reasoning

Amrith Setlur; Chirag Nagpal; Adam Fisch; Xinyang Geng; Jacob; Eisenstein; Rishabh Agarwal; Alekh Agarwal; Jonathan Berant; Aviral Kumar

arXiv:2410.08146·cs.LG·October 11, 2024·3 cites

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob, Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

PDF

Open Access

TL;DR

This paper introduces process reward models (PRMs) for large language models, which provide step-level feedback based on progress, leading to improved reasoning, exploration, and efficiency compared to outcome reward models (ORMs).

Contribution

It proposes a novel method for designing process rewards based on progress measurement under a distinct prover policy, with theoretical characterization and empirical validation showing significant improvements.

Findings

01

Test-time search with PAVs is >8% more accurate.

02

Online RL with PAVs yields 5-6x sample efficiency gains.

03

PAVs are 1.5-5x more compute-efficient than ORMs.

Abstract

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Semantic Web and Ontologies · Service-Oriented Architecture and Web Services

MethodsSparse Evolutionary Training · Balanced Selection