From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

Yuzhen Huang; Weihao Zeng; Xingshan Zeng; Qi Zhu; Junxian He

arXiv:2505.22203·cs.LG·October 8, 2025

From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, Junxian He

PDF

Open Access 4 Reviews

TL;DR

This paper analyzes the reliability of rule- and model-based verifiers in mathematical reasoning within reinforcement learning, revealing their limitations and vulnerabilities, and highlighting the need for more robust verification methods.

Contribution

It provides a comprehensive comparison of rule- and model-based verifiers, exposing their weaknesses and susceptibility to manipulation in mathematical reasoning tasks.

Findings

01

Rule-based verifiers often fail to recognize equivalent answers, causing false negatives.

02

Model-based verifiers achieve higher static accuracy but are vulnerable to hacking.

03

Both verifier types have limitations that impact RL training effectiveness.

Abstract

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper is clearly written and easy to follow, with a logical structure and clear presentation of results. 2. It addresses an important and timely question about the reliability of verifiers in RL-based fine-tuning. 3. The work provides a detailed and systematic statistical analysis comparing rule-based and model-based verifiers across multiple benchmarks.

Weaknesses

1. The paper is mostly empirical and lacks a formal analysis of why RL dynamics amplify verifier brittleness. 2. The study focuses almost exclusively on mathematical reasoning; generalization to other domains is less mentioned. 3. Reported gains in RL experiments are small and may not exceed noise given limited sampling. Statistical uncertainty isn’t reported. 4. The paper lacks a clear concluding message or actionable suggestion. While it identifies the limitations of both rule-based and model

Reviewer 02Rating 4Confidence 4

Strengths

- **Originality.** The paper provides a systematic and timely investigation of verifier design in RL with verifiable rewards (RLVR), providing one of the first comprehensive analyses of how verifier accuracy impacts training stability and model performance and exposes limitations in current verification systems. - **Comprehensive experimentation.** The study conducts extensive experiments comparing rule-based and model-based verifiers, builds dedicated diagnostic datasets, and performs multiple

Weaknesses

- **[Significance]** While the paper presents systematic experiments and insightful analyses, many of its findings confirm known issues rather than reveal fundamentally new phenomena. Specifically, (1) the false-negative problem of rule-based verifiers has been discussed in prior work on mathematical expression evaluation (e.g., [1], [2]); and (2) the vulnerability of LLM-based verifiers to reward hacking aligns with broader findings on LLM-as-a-judge robustness and the reward hacking in RLHF (e

Reviewer 03Rating 2Confidence 4

Strengths

* The paper attempts to clarify and analyze key issues often overlooked in RLHF and RLVR research, particularly the limitations of rule-based versus model-based verifiers. This focus addresses an important and timely problem in the field. * The discussion on potential reward hacking and robustness issues arising from the use of model-based verifiers is interesting, providing new perspectives on challenges that are often underexplored in current research.

Weaknesses

**\[W1\] Insufficient Analysis** The paper's main motivation is that the impact of verifier types on RLVR is poorly understood, yet it lacks in-depth analysis on this topic. There is no error case analysis explaining why rule- and model-based verifiers fail, nor any examination of how these failures influence policy behavior. Without these critical components, the paper lacks the insights necessary to address its core motivation. **\[W2\] Low Readability** The overall organization of the pa

Reviewer 04Rating 4Confidence 4

Strengths

1. Verifier reliability is practically and conceptually important in RL. 2. The static and dynamic analyses span multiple open-source verifiers and datasets, revealing concrete recall-precision trade-offs. 3. The paper goes beyond accuracy metrics, exposing vulnerabilities of fine-tuned verifiers and proposing reward hacks.

Weaknesses

1. In Figure 1, the differences among rule-based, verifier-based, and oracle reward curves are relatively minor. Table 2 further shows that the hybrid or model-based verifiers yield only about +2 points over the baseline. It is unclear whether such modest gains justify the additional computational and implementation overhead of integrating verifiers into the RL loop. 2. The curves labeled as verifier-hacked and non-hacked in Figure 1 are almost overlapping except at the very last step. This mak

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning