VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan; Jin Jiang; Zhenbang Ren; Yijun Li; Xudong Cai; Yang Liu; Xin Xu; Mengdi Zhang; Jian Shao; Yongliang Shen; Jun Xiao; Yueting Zhuang

arXiv:2505.15801·cs.CL·February 19, 2026

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang

PDF

Open Access 3 Models 1 Datasets 3 Reviews

TL;DR

VerifyBench introduces new benchmarks to evaluate reference-based reward systems in large language models, highlighting current limitations and guiding future improvements in verification accuracy for reasoning tasks.

Contribution

The paper presents VerifyBench and VerifyBench-Hard benchmarks, filling a gap by focusing on verification against ground truth references in RL training of large models.

Findings

01

Larger verifiers perform better on standard cases

02

All systems struggle with challenging instances

03

Benchmark analysis reveals areas for improvement

Abstract

Large reasoning models such as OpenAI o1 and DeepSeek-R1 have demonstrated remarkable performance in complex reasoning tasks. A critical component of their training is the incorporation of reference-based reward systems within reinforcement learning (RL), where model outputs are evaluated against ground truth references. However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reasoning model training. In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Our…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper is well-written and easy to follow. 2. The proposed benchmarks, VerifyBench and VerifyBench-Hard, are well-designed to evaluate the capabilities of reward models in assessing the correctness of reasoning steps in LLM-generated outputs. These benchmarks could be valuable resources for the research community.

Weaknesses

1. The motivation of the paper is somewhat questionable. Typically, the correctness of responses can be directly verified against ground truth answers without the need for reward models. Using a reward model to verify correctness can achieve higher accuracy compared to directly comparing with ground truth answers, as demonstrated in Figure 12, 13, and 14. However, it is unclear whether the added complexity of using reward models is justified given the performance gain. 2. The benchmark appears t

Reviewer 02Rating 6Confidence 3

Strengths

- The paper is well-written, smooth, and easy to follow, with clear structure and presentation. - The data curation pipelines are comprehensive and well-designed. - The benchmark tackles a challenge that is not addressed in existing benchmarks.

Weaknesses

- The final source distribution of completions may be biased toward specific models, which could undermine benchmark diversity. More analysis on this would address this. - Some more clarifications on the data curation pipeline will further increase clarity of the paper.

Reviewer 03Rating 6Confidence 4

Strengths

1. The benchmark provides a validated evaluation framework that enables quantitative assessment of verifiers. 2. The test setting is diverse and challenging, and it is able to distinguish even among SOTA models. 3. The experiments empirically demonstrate a correlation between higher benchmark scores and better training performance.

Weaknesses

1. The paper does not appear to include examples of difficult samples. Showing more complex and challenging instances from the benchmark would be convicing. 2. The benchmark is mainly focused on mathematics, so it may not reflect performance on verifiable tasks across the broader domain, such as evaluating physics expressions or chemical equations.

Code & Models

Models

Datasets

ZJU-REAL/VerifyBench
dataset· 126 dl
126 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques