TL;DR
This paper introduces VerifyBench, a comprehensive cross-domain benchmark with 4,000 expert-annotated questions to systematically evaluate and compare the performance of various verifiers, revealing fundamental trade-offs and limitations.
Contribution
It presents VerifyBench, the first systematic benchmark for evaluating verifier performance across multiple domains, and provides insights into their strengths, weaknesses, and generalization challenges.
Findings
Specialized verifiers have higher accuracy but lower recall.
General models are more inclusive but less precise.
Verifiers are highly sensitive to input structure and domain shifts.
Abstract
Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers' performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench--a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
