Reinforcing General Reasoning without Verifiers

Xiangxin Zhou; Zichen Liu; Anya Sims; Haonan Wang; Tianyu Pang; Chongxuan Li; Liang Wang; Min Lin; Chao Du

arXiv:2505.21493·cs.LG·May 28, 2025

Reinforcing General Reasoning without Verifiers

Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces VeriFree, a verifier-free reinforcement learning method that directly maximizes the probability of reference answers, enabling general reasoning in large language models without the need for rule-based or model-based verifiers.

Contribution

VeriFree extends RL training for large language models to general reasoning tasks by removing the need for verifiers, reducing compute costs and maintaining or improving performance.

Findings

01

VeriFree matches or surpasses verifier-based methods on multiple benchmarks.

02

It significantly reduces computational and practical burdens compared to verifier-based approaches.

03

The method offers a unified training of policy and implicit verifier as a variational optimization.

Abstract

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper is very well-written and structured well. - The concept of verifier-free RL eliminates reliance on rule-based or LLM verifiers, making RL training scalable to open-ended reasoning domains. Further, it is derived directly from the RL objective; proven equivalence to RLVR under certain assumptions with formal variance reduction. - Strong empirical results: Matches or surpasses verifier-based methods across multiple benchmarks while being simpler, faster, and less memory-intensive. T

Weaknesses

- In Tables 1 and 2, it is unclear why the accuracy of Qwen3-4B-Base-Verifier is lower than that of Qwen3-4B-Base-VeriFree. In VeriFree, the LLM output is first parsed into reasoning tokens and a generated final answer (y). Then, the generated answer is replaced with the gold-standard final answer (y⁎). If only one answer (y⁎) is correct and receives a reward of 1 (while all others receive 0), then the expected reward for a reasoning trace z can be computed directly as the probability assigned t

Reviewer 02Rating 4Confidence 3

Strengths

1. The proposed VeriFree framework is simple to implement and bypass the need of verifiers. 2. Experimental results demonstrate that VeriFree achieves comparable results to the verifier-based baseline and shows good transferable reasoning skill gains.

Weaknesses

1. Evaluation reliability: Single run pass@1 results are too noisy for small benchmarks like MATH-500, Minverva etc. The author might have to report Avg@k and other statistics for robust demonstration. 2. Missing baselines: many existing probability- or frequency-based baselines are missing, such as TTRL. 3. The assumption that only single accurate answer exists might be problematic due to the flexibility of language (e.g., 'A is larger than B' and 'B is the smaller one'). The author might need

Reviewer 03Rating 2Confidence 3

Strengths

(+) The paper presents an interesting approach to bypass the need for verification for more general reasoning tasks that might not be rule-based. (+) The claims are supported by analysis (for one specific setting- also see Weaknesses, below) and experimental evaluations. (+) The performance of VeriFree is comparable to verifier-based methods on a large variety of tasks, providing a promise of improved computational efficiency resulting from eliminating reliance on a strong verifier. (+) Se

Weaknesses

(-) The authors claim that VeriFree is the first verifier-free methodology for this class of problems. It is not clear if benchmarking against verifier-based methods only adequate. (-) What does `rule-based answer verification’ exactly mean? It is reasonable to conjecture that some of the domains mentioned might support rule-based answer verification (e.g., based on statutes of laws). This question becomes even more relevant since the authors state in the Limitations (in the Appendix) that Ver

Code & Models

Repositories

sail-sg/verifree
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, Reasoning, and Knowledge