Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini

TL;DR
This paper introduces RL$^V$, a method that enhances RL fine-tuning of LLM reasoners by enabling joint training for reasoning and verification, significantly improving test-time scaling and accuracy.
Contribution
RL$^V$ is a novel approach that unifies reasoning and verification training, boosting performance and enabling efficient test-time compute scaling without major overhead.
Findings
RL$^V$ increases MATH accuracy by over 20% with parallel sampling.
Enables 8-32× more efficient test-time compute scaling.
Achieves 1.2-1.6× higher performance with combined parallel and sequential scaling.
Abstract
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. Yet if parallel test-time compute is already part of the deployment plan, training should be designed to support it. In this work, we propose RL that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL boosts MATH accuracy by over 20\% with parallel sampling and enables efficient test-time compute scaling compared to the base RL method. RL also exhibits strong generalization capabilities for both easy-to-hard and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
