Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner
Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, and Bing Qin

TL;DR
This paper introduces a process reward mechanism at the thought level for large reasoning models, significantly improving training efficiency and accuracy in math problem solving by reducing sparse reward issues.
Contribution
It proposes a novel intrinsic signal-driven generative evaluation mechanism and a capability-adaptive reward system, integrated into a new RL algorithm, TP-GRPO, to enhance LRM training.
Findings
Achieves higher accuracy with fewer training samples.
Reduces reward hacking and improves credit assignment.
Demonstrates effectiveness on 1.5B and 7B parameter models.
Abstract
Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent 'thought' units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
