Shrinking the Generation-Verification Gap with Weak Verifiers
Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher R\'e

TL;DR
Weaver is a framework that combines multiple weak verifiers into a strong, accurate verifier using weighted ensembles and weak supervision, significantly closing the gap with oracle verifiers in language model evaluation.
Contribution
We introduce Weaver, a novel method for combining weak, imperfect verifiers into a strong verifier using weighted ensembles and weak supervision techniques.
Findings
Weaver achieves 87.7% accuracy in test-time response selection.
It significantly outperforms unweighted verifier combinations.
A small cross-encoder trained with Weaver's scores reduces computational costs.
Abstract
Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗hazyresearch/Weaver_Distilled_ModernBERT_Large_for_MMLU-Promodel· 3 dl· ♡ 13 dl♡ 1
- 🤗hazyresearch/Weaver_Distilled_ModernBERT_Large_for_MATH500model
- 🤗hazyresearch/Weaver_Distilled_ModernBERT_Large_for_GPQAmodel
- 🤗hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instructmodel· ♡ 2♡ 2
- 🤗hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-largemodel· ♡ 2♡ 2
- hazyresearch/MATH500_with_Llama_3.1_70B_Instruct_v1dataset· 38 dl38 dl
- hazyresearch/GPQA_with_Llama_3.1_70B_Instruct_v1dataset· 73 dl73 dl
- hazyresearch/MMLU_with_Llama_3.1_70B_Instruct_v1dataset· 19 dl19 dl
- hazyresearch/MMLU-Pro_with_Llama_3.1_70B_Instruct_v1dataset· 36 dl36 dl
- hazyresearch/MATH500_with_Llama_3.1_8B_Instruct_v1dataset· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLLaMA
