When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Zelin Zhang; Fei Cheng; Chenhui Chu

arXiv:2603.16578·cs.LG·March 18, 2026

When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Zelin Zhang, Fei Cheng, Chenhui Chu

PDF

Open Access

TL;DR

This paper investigates when and why unsupervised reinforcement learning enhances mathematical reasoning in language models by introducing intrinsic rewards, analyzing model capabilities, and using geometric diagnostics to understand stability and failure modes.

Contribution

It proposes intrinsic rewards for stable reasoning, explores the influence of logical priors, and introduces a geometric diagnostic framework to explain model stability and failure.

Findings

01

Intrinsic rewards improve reasoning performance.

02

Model success depends on foundational logical priors.

03

Geometric diagnostics reveal stability boundaries.

Abstract

Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model's foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics