Scaling Test-Time Compute Without Verification or RL is Suboptimal
Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

TL;DR
This paper demonstrates that verifier-based finetuning of large language models significantly outperforms verifier-free methods as test-time compute and data scale, especially for heterogeneous solution distributions.
Contribution
The paper proves the superiority of verifier-based finetuning over verifier-free approaches in scaling test-time compute for large language models.
Findings
Verifier-based methods outperform verifier-free methods as compute scales.
Performance gap widens with larger test-time budgets and data.
Empirical validation on multiple reasoning tasks with various model sizes.
Abstract
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Software Testing and Debugging Techniques
MethodsBalanced Selection
