Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen; Zihan Liu; Shun Zheng; Shengyu Ye; Zhirong Wu; Yang Wang; Zhijian Xu; Xiao Liang; Junjie Li; Ziming Miao; Jiang Bian; Mao Yang

arXiv:2506.14245·cs.AI·October 3, 2025

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper investigates how Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning in large language models, demonstrating that RLVR encourages correct reasoning and extends reasoning capabilities through a new evaluation metric and theoretical analysis.

Contribution

It introduces the CoT-Pass@K metric and a theoretical framework explaining RLVR's incentive mechanism for improving reasoning in LLMs.

Findings

01

RLVR extends reasoning boundaries in mathematical and coding tasks.

02

The new CoT-Pass@K metric effectively measures reasoning success.

03

RLVR incentivizes correct reasoning early in training.

Abstract

Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper systematically investigates the impact of RLVR on LLM reasoning. We revisit Pass@K experiments and demonstrate that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper provides a well-structured theoretical analysis explaining how GRPO gradients implicitly favor logically consistent reasoning traces, offering insights on why RLVR often improves reasoning without explicit step-wise rewards. 2. Experiments on multiple reasoning domains (math and code) and models (Qwen, Nemotron) show consistent gains under both Pass@K and the proposed CoT-Pass@K metrics. The analyses of training dynamics and reasoning correctness lend strong empirical support to th

Weaknesses

1. The reliability of CoT-Pass@K hinges on judgments from a single verifier (DeepSeek-R1-Qwen3-8B). While multi-verification is discussed, evaluating under different verifiers would improve confidence in these results. How robust are CoT-Pass@K results when using different verifiers? -- we can fix the verifier used in training and just use different verifiers for evaluation. 2. Potentially questionable assumptions. The theoretical analysis assumes that reasoning traces with correct answers have

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper reads smoothly and presents its ideas in a clear, easy-to-follow style. 2. The experimental design is comprehensive and well thought-out. 3. RLVR provides only marginal gains at Pass@1024 or Pass@2048, but demonstrably boosts Pass@1—a widely recognized pattern. This supports the prevailing view that RLVR achieves its effect by learning correct reasoning, a claim for which your experiments offer some evidence.

Weaknesses

Please refer to the "Questions" section for details.

Reviewer 03Rating 6Confidence 4

Strengths

- The exact role of RLVR in enhancing the reasoning capabilities of base models is an open and fundamental problem for the RL community. Thus, this manuscript begins with a very strong motivation and is, in my view, highly timely. - Although the proposed Theorem 1 is relatively simple and straightforward, this preliminary explanation is helpful for the community to understand the underlying reasons why RLVR works. - The authors base their conclusions on extensive experiments. I particularly

Weaknesses

- The correctness of a CoT is determined by a R1-distilled-Qwen3-8B model, which may not be a sufficiently strong verifier. Since reasoning models may frequently exhibit rethinking or backtracking behavior, I have concerns about whether an 8B model is reliable enough to verify complex reasoning traces that include self-correction, reflection, and backtracking. For example, what if a model initially thinks incorrectly but then corrects its answer on its own? Is the distilled 8B model strong enoug

Code & Models

Datasets

XumengWen/AIME24-25_CoT_Verification
dataset· 80 dl
80 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsBalanced Selection