Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner

Tao He; Rongchuan Mu; Lizi Liao; Yixin Cao; Ming Liu; and Bing Qin

arXiv:2507.23317·cs.LG·August 1, 2025

Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner

Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, and Bing Qin

PDF

Open Access

TL;DR

This paper introduces a process reward mechanism at the thought level for large reasoning models, significantly improving training efficiency and accuracy in math problem solving by reducing sparse reward issues.

Contribution

It proposes a novel intrinsic signal-driven generative evaluation mechanism and a capability-adaptive reward system, integrated into a new RL algorithm, TP-GRPO, to enhance LRM training.

Findings

01

Achieves higher accuracy with fewer training samples.

02

Reduces reward hacking and improves credit assignment.

03

Demonstrates effectiveness on 1.5B and 7B parameter models.

Abstract

Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent 'thought' units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning