When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective
Zelin Zhang, Fei Cheng, Chenhui Chu

TL;DR
This paper investigates when and why unsupervised reinforcement learning enhances mathematical reasoning in language models by introducing intrinsic rewards, analyzing model capabilities, and using geometric diagnostics to understand stability and failure modes.
Contribution
It proposes intrinsic rewards for stable reasoning, explores the influence of logical priors, and introduces a geometric diagnostic framework to explain model stability and failure.
Findings
Intrinsic rewards improve reasoning performance.
Model success depends on foundational logical priors.
Geometric diagnostics reveal stability boundaries.
Abstract
Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model's foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics
