Generative Verifiers: Reward Modeling as Next-Token Prediction

Lunjun Zhang; Arian Hosseini; Hritik Bansal; Mehran Kazemi; Aviral; Kumar; Rishabh Agarwal

arXiv:2408.15240·cs.LG·February 25, 2025·2 cites

Generative Verifiers: Reward Modeling as Next-Token Prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral, Kumar, Rishabh Agarwal

PDF

Open Access 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces generative verifiers (GenRM), trained via next-token prediction, which outperform traditional discriminative verifiers in guiding large language models for reasoning tasks, with improved accuracy and scalability.

Contribution

Proposes a novel generative verifier training method using next-token prediction, enhancing LLM-based verification and reasoning capabilities over existing discriminative approaches.

Findings

01

GenRM outperforms discriminative verifiers on multiple benchmarks.

02

Training with synthetic rationales improves error detection in math problems.

03

GenRM scales well with model size and test-time compute.

Abstract

Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

1. Innovative Approach: Reframing verification as a generative task, specifically with GenRM-CoT, is novel and shows promise for complex reasoning tasks. 2. Synthetic Rationale Generation: The use of the same model to generate both solutions and synthetic rationales offers a more streamlined and potentially scalable verification process. 3. Improved Performance: Results indicate that GenRM-CoT improves upon discriminative reward models, especially when using chain-of-thought (CoT)/ScratchPad [1]

Weaknesses

Scientific Reservations 1. Limited Mathematical Task Scope: The reliance on GSM8K and limited algorithmic tasks raises concerns about generalizability. These datasets represent only basic levels of math reasoning (grade school and high school). Including results from more rigorous benchmarks, such as the IMO portion of OlympiadBench [2] or math subsets of MMLU, would strengthen the claims. 2. Over-Reliance on Proprietary Model (Gemini 1.0 Pro): By using Gemini 1.0 Pro to generate solutions and

Reviewer 02Rating 8Confidence 3

Strengths

The authors examine a very relevant and important problem of learning good verifiers for LLM generations. They present a new method that is well-motivated (increasing verification compute and framing verification to match the original LLM objective). The authors conduct a lot of key experiments exploring these dimensions, the paper is explained clearly and is very easy to understand. - In particular, it’s great to see experiments measuring the generation performance change as well as experime

Weaknesses

In many of the plots (figure 1, 4, 5, 6), the y-axis scaling changes from plot-to-plot and is often very restricted (ie. sometimes spanning only 4%). This is misleading when comparing results, and it would be great to standardize it more. GenRM does improve over the baselines (it seems like more on harder tasks which is worth highlighting more!) but a lot of times the improvement is relatively small (ex. 1% for gsm8k over discriminative). "In Figure 8, we show that generative verifiers, especi

Reviewer 03Rating 3Confidence 4

Strengths

An easy to understand method that works in the few experiments of the paper.

Weaknesses

I have several concerns about this paper. First, I don't think such a process can be called a "verifier", as there is no rigor in the entire process. Especially, we do not have any guarantee on the final probability value. It completely relies on the quality of the other LLM to evaluate the solution, and as mentioned at the beginning of the paper, " ... often confidently make logical and factual mistakes ". I understand this is what the community is doing, but on the other hand, this paper does

Code & Models

Datasets

flowingpurplecrane/genrm
dataset· 31 dl
31 dl

Videos

Generative Verifiers: Reward Modeling as Next-Token Prediction· slideslive

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling

MethodsDirect Preference Optimization