When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

TL;DR
This study systematically evaluates when and how verification improves large language model solutions across diverse models, tasks, and training methods, introducing a new metric to predict verification benefits.
Contribution
It provides a comprehensive analysis of verification effectiveness across multiple model families, sizes, and training variants, and introduces verifier gain as a predictive metric.
Findings
Verification across different model families is more effective than within the same family.
Benefits of verification decrease as solver and verifier become more similar.
Reasoning post-training enhances cross-family verification improvements.
Abstract
Large language models (LLMs) can act as both problem solvers and solution verifiers, where the latter select high-quality answers from a pool of solver-generated candidates. This raises the question of under what conditions verification pays off in solver-verifier systems. Prior work has conducted only limited studies of the factors influencing verification performance, focusing primarily on self-verification and examining neither the relationship between solver and verifier model families nor the effects of reasoning post-training. To rectify this, we present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. In order to support our analysis, we introduce and empirically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
