RLPR: Extrapolating RLVR to General Domains without Verifiers
Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua

TL;DR
RLPR introduces a verifier-free method leveraging LLMs' intrinsic token probabilities to improve reasoning across diverse domains, overcoming the limitations of domain-specific verifiers.
Contribution
It proposes a novel verifier-free framework that uses LLM token probabilities as reward signals, enabling reasoning extrapolation to general domains without domain-specific verifiers.
Findings
RLPR outperforms verifier-dependent methods on multiple benchmarks.
The approach enhances reasoning capabilities in both mathematical and general domains.
Stable reward estimation from intrinsic probabilities is key to RLPR's success.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper evaluates RLPR across multiple tasks using multiple model families. - Paper is well-written and easy to follow.
- Equation 2 seems flawed. Consider two responses "I'm good, not bad" and "I'm bad, not good". Although they are semantically different, Equation 2 seems incapable of distinguishing them in reward. In other words, the token probability reward does not evaluate correctness and is not semantically reliable. - Why latent factors are additive decomposable? The authors could provide more intuition behind it. - The choice of Avg@k seems inconsistent. I understand this is due to the budget concern. Bu
1. This paper addresses an important research question of RL for general domains. The verifier-free approach is a meaningful step toward broadening RLVR to free-form natural language domains, where rule-based or model-based verifiers are impractical or costly. 2. The results are impressive on paper, with RLPR outperforming strong baselines. It also boosts math performance without math-specific data, suggesting transferability.
1. Using token-level probabilities as a reward signal is conceptually close to likelihood-based or entropy-based self-reward methods (e.g., VeriFree) 2. The paper presents a series of empirical engineering techniques (essentially a "bag of tricks") without providing rigorous theoretical justifications 3. Depending on the LLM's intrinsic probabilities as a reward signal introduces circularity and uncertainty. If the base model is biased, or overconfident, this could amplify errors by reinforcing
1. It introduces a new verifier-free RL method leveraging model probability as intrinsic feedback. Overall it is an interesting idea to use the model's own confidence to evaluate the COT and the final answer to assign reward. I can see it is very promising for the field without an easily verifiable answer. The paper also demonstrated empirically the improvements over the general dataset without an easily verifiable answer. 2. The experiments are comprehensive enough to cover both general and m
Presentation clarity: =============== Section 2.2 should have a more clear description. For example, using $z$ and $y$ to denote the response rather than just $o$, then you could use notation $z_1, …, z_n, y_1, …, y_m$. The modified sequence could have more clarity and aligns with previous notations. Otherwise it makes it confusing and might regard the entire sequence as modified. In addition, the $f_seq$ average in l167 should have a more clear definition rather than abusing the notation. Clar
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
