How Far Can Unsupervised RLVR Scale LLM Training?
Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding

TL;DR
This paper analyzes the potential and limitations of unsupervised reinforcement learning with verifiable rewards (URLVR) for scaling large language model training, revealing theoretical insights, experimental patterns, and future directions.
Contribution
It provides a comprehensive taxonomy, a unified theoretical framework, and experimental validation of intrinsic versus external URLVR methods, highlighting their convergence behavior and scalability limits.
Findings
Intrinsic rewards follow a rise-then-fall pattern during training.
Model prior influences collapse timing more than engineering choices.
External rewards grounded in computational asymmetries may overcome intrinsic limitations.
Abstract
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall…
Peer Reviews
Decision·ICLR 2026 Poster
Proposes to view the changes to the policy through the lens of trading uncertainty for confidence and provides a systematic, empirical analysis of the distinct failure modes of different reward types, such as length collapse for probability rewards and verbosity for entropy rewards. A practical strength of this paper is its investigation of scaling limits, which concludes that while large-scale training leads to collapse, these methods are stable and effective in small, domain-specific settings,
A closely related analysis was published recently (Jun 2025) by Y. Zhang et al (No Free Lunch: Rethinking Internal Feedback for LLM Reasoning). Arguably, the core insights of these two publications are very similar; with Y. Zhang et al provide a stronger grounding in theory and provide a more in depth analysis of learning dynamics for different base-models; while this paper provides more insight into the dynamics and failure-modes of the different reward signals. Overall, the authors cite this p
the paper presents a consolidated mathematical perspective that connect different intrinsic rewards methods under a single framework which is very nice. The authors also present a solid evaluation of intrinsic rewards - they analysed different failure modes for different methods. I think the paper is very insightful and written clearly and I think it's very interesting to investigate these novelity driven rewards within RL for LLM's.
this is not necessarily a weakness but the paper really focueses on the analysis of these different intrinsic rewards and does not so much contribute any new method ontop of this. While I personally really enjoyed reading this paper I am not 100% sure this is the right venue for this.
1. The topic is timely, given the increase in popularity of methods using RL with intrinsic rewards for improving the capabilities of language models. 2. The empirical analysis sheds light on the difference between intrinsic rewards and their potential failure patterns. 3. I find the suggestion of using intrinsic reward dynamics in the initial time steps as an indicator for the potential success of URLVR, as opposed to using measures such as pass@k that require access to a verifier or labels,
1. The unified reward framework in Section 3.1 is currently not well-defined and its usefulness is not substantiated in the paper. Specifically, how does the right hand side of Equation (1) depend on $y$? Should the cross-entropy term $h$ be $-q^i (y | x) \ln \pi_\theta^i (y | x)$? Moreover, the significance of such a unified perspective greatly depends on whether it allows characterizing similarities and differences between different instances. However, the unified framework is not really used
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
