VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains

Xuzhao Li; Xuchen Li; Shiyu Hu; Yongzhen Guo; Wentao Zhang

arXiv:2507.09884·cs.AI·July 29, 2025

VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains

Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang

PDF

1 Video

TL;DR

This paper introduces VerifyBench, a comprehensive cross-domain benchmark with 4,000 expert-annotated questions to systematically evaluate and compare the performance of various verifiers, revealing fundamental trade-offs and limitations.

Contribution

It presents VerifyBench, the first systematic benchmark for evaluating verifier performance across multiple domains, and provides insights into their strengths, weaknesses, and generalization challenges.

Findings

01

Specialized verifiers have higher accuracy but lower recall.

02

General models are more inclusive but less precise.

03

Verifiers are highly sensitive to input structure and domain shifts.

Abstract

Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers' performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench--a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains· underline