Shrinking the Generation-Verification Gap with Weak Verifiers

Jon Saad-Falcon; E. Kelly Buchanan; Mayee F. Chen; Tzu-Heng Huang; Brendan McLaughlin; Tanvir Bhathal; Shang Zhu; Ben Athiwaratkun; Frederic Sala; Scott Linderman; Azalia Mirhoseini; Christopher R\'e

arXiv:2506.18203·cs.CL·December 10, 2025

Shrinking the Generation-Verification Gap with Weak Verifiers

Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher R\'e

PDF

5 Models 5 Datasets

TL;DR

Weaver is a framework that combines multiple weak verifiers into a strong, accurate verifier using weighted ensembles and weak supervision, significantly closing the gap with oracle verifiers in language model evaluation.

Contribution

We introduce Weaver, a novel method for combining weak, imperfect verifiers into a strong verifier using weighted ensembles and weak supervision techniques.

Findings

01

Weaver achieves 87.7% accuracy in test-time response selection.

02

It significantly outperforms unweighted verifier combinations.

03

A small cross-encoder trained with Weaver's scores reduces computational costs.

Abstract

Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLLaMA