Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Kusha Sareen; Morgane M Moss; Alessandro Sordoni; Rishabh Agarwal; Arian Hosseini

arXiv:2505.04842·cs.LG·April 14, 2026

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini

PDF

TL;DR

This paper introduces RL$^V$, a method that enhances RL fine-tuning of LLM reasoners by enabling joint training for reasoning and verification, significantly improving test-time scaling and accuracy.

Contribution

RL$^V$ is a novel approach that unifies reasoning and verification training, boosting performance and enabling efficient test-time compute scaling without major overhead.

Findings

01

RL$^V$ increases MATH accuracy by over 20% with parallel sampling.

02

Enables 8-32× more efficient test-time compute scaling.

03

Achieves 1.2-1.6× higher performance with combined parallel and sequential scaling.

Abstract

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. Yet if parallel test-time compute is already part of the deployment plan, training should be designed to support it. In this work, we propose RL $^{V}$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL $^{V}$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8 - 32 \times$ efficient test-time compute scaling compared to the base RL method. RL $^{V}$ also exhibits strong generalization capabilities for both easy-to-hard and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.