RLPR: Extrapolating RLVR to General Domains without Verifiers

Tianyu Yu; Bo Ji; Shouli Wang; Shu Yao; Zefan Wang; Ganqu Cui; Lifan Yuan; Ning Ding; Yuan Yao; Zhiyuan Liu; Maosong Sun; Tat-Seng Chua

arXiv:2506.18254·cs.LG·June 24, 2025

RLPR: Extrapolating RLVR to General Domains without Verifiers

Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua

PDF

3 Models 2 Datasets 3 Reviews

TL;DR

RLPR introduces a verifier-free method leveraging LLMs' intrinsic token probabilities to improve reasoning across diverse domains, overcoming the limitations of domain-specific verifiers.

Contribution

It proposes a novel verifier-free framework that uses LLM token probabilities as reward signals, enabling reasoning extrapolation to general domains without domain-specific verifiers.

Findings

01

RLPR outperforms verifier-dependent methods on multiple benchmarks.

02

The approach enhances reasoning capabilities in both mathematical and general domains.

03

Stable reward estimation from intrinsic probabilities is key to RLPR's success.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The paper evaluates RLPR across multiple tasks using multiple model families. - Paper is well-written and easy to follow.

Weaknesses

- Equation 2 seems flawed. Consider two responses "I'm good, not bad" and "I'm bad, not good". Although they are semantically different, Equation 2 seems incapable of distinguishing them in reward. In other words, the token probability reward does not evaluate correctness and is not semantically reliable. - Why latent factors are additive decomposable? The authors could provide more intuition behind it. - The choice of Avg@k seems inconsistent. I understand this is due to the budget concern. Bu

Reviewer 02Rating 4Confidence 4

Strengths

1. This paper addresses an important research question of RL for general domains. The verifier-free approach is a meaningful step toward broadening RLVR to free-form natural language domains, where rule-based or model-based verifiers are impractical or costly. 2. The results are impressive on paper, with RLPR outperforming strong baselines. It also boosts math performance without math-specific data, suggesting transferability.

Weaknesses

1. Using token-level probabilities as a reward signal is conceptually close to likelihood-based or entropy-based self-reward methods (e.g., VeriFree) 2. The paper presents a series of empirical engineering techniques (essentially a "bag of tricks") without providing rigorous theoretical justifications 3. Depending on the LLM's intrinsic probabilities as a reward signal introduces circularity and uncertainty. If the base model is biased, or overconfident, this could amplify errors by reinforcing

Reviewer 03Rating 6Confidence 3

Strengths

1. It introduces a new verifier-free RL method leveraging model probability as intrinsic feedback. Overall it is an interesting idea to use the model's own confidence to evaluate the COT and the final answer to assign reward. I can see it is very promising for the field without an easily verifiable answer. The paper also demonstrated empirically the improvements over the general dataset without an easily verifiable answer. 2. The experiments are comprehensive enough to cover both general and m

Weaknesses

Presentation clarity: =============== Section 2.2 should have a more clear description. For example, using $z$ and $y$ to denote the response rather than just $o$, then you could use notation $z_1, …, z_n, y_1, …, y_m$. The modified sequence could have more clarity and aligns with previous notations. Otherwise it makes it confusing and might regard the entire sequence as modified. In addition, the $f_seq$ average in l167 should have a more clear definition rather than abusing the notation. Clar

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.