Generative Verifiers: Reward Modeling as Next-Token Prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral, Kumar, Rishabh Agarwal

TL;DR
This paper introduces generative verifiers (GenRM), trained via next-token prediction, which outperform traditional discriminative verifiers in guiding large language models for reasoning tasks, with improved accuracy and scalability.
Contribution
Proposes a novel generative verifier training method using next-token prediction, enhancing LLM-based verification and reasoning capabilities over existing discriminative approaches.
Findings
GenRM outperforms discriminative verifiers on multiple benchmarks.
Training with synthetic rationales improves error detection in math problems.
GenRM scales well with model size and test-time compute.
Abstract
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We…
Peer Reviews
Decision·ICLR 2025 Poster
1. Innovative Approach: Reframing verification as a generative task, specifically with GenRM-CoT, is novel and shows promise for complex reasoning tasks. 2. Synthetic Rationale Generation: The use of the same model to generate both solutions and synthetic rationales offers a more streamlined and potentially scalable verification process. 3. Improved Performance: Results indicate that GenRM-CoT improves upon discriminative reward models, especially when using chain-of-thought (CoT)/ScratchPad [1]
Scientific Reservations 1. Limited Mathematical Task Scope: The reliance on GSM8K and limited algorithmic tasks raises concerns about generalizability. These datasets represent only basic levels of math reasoning (grade school and high school). Including results from more rigorous benchmarks, such as the IMO portion of OlympiadBench [2] or math subsets of MMLU, would strengthen the claims. 2. Over-Reliance on Proprietary Model (Gemini 1.0 Pro): By using Gemini 1.0 Pro to generate solutions and
The authors examine a very relevant and important problem of learning good verifiers for LLM generations. They present a new method that is well-motivated (increasing verification compute and framing verification to match the original LLM objective). The authors conduct a lot of key experiments exploring these dimensions, the paper is explained clearly and is very easy to understand. - In particular, it’s great to see experiments measuring the generation performance change as well as experime
In many of the plots (figure 1, 4, 5, 6), the y-axis scaling changes from plot-to-plot and is often very restricted (ie. sometimes spanning only 4%). This is misleading when comparing results, and it would be great to standardize it more. GenRM does improve over the baselines (it seems like more on harder tasks which is worth highlighting more!) but a lot of times the improvement is relatively small (ex. 1% for gsm8k over discriminative). "In Figure 8, we show that generative verifiers, especi
An easy to understand method that works in the few experiments of the paper.
I have several concerns about this paper. First, I don't think such a process can be called a "verifier", as there is no rigor in the entire process. Especially, we do not have any guarantee on the final probability value. It completely relies on the quality of the other LLM to evaluate the solution, and as mentioned at the beginning of the paper, " ... often confidently make logical and factual mistakes ". I understand this is what the community is doing, but on the other hand, this paper does
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
MethodsDirect Preference Optimization
