ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou; Ling Yang; Jingwen Gu; Jiahao Qiu; Ke Shen; Jingrui He; Mengdi Wang

arXiv:2506.18896·cs.CL·September 26, 2025

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang

PDF

2 Repos 3 Models

TL;DR

ReasonFlux-PRM is a trajectory-aware reward model designed to evaluate and improve the reasoning process of large language models by incorporating step and trajectory supervision, leading to better data selection and performance in reasoning tasks.

Contribution

This work introduces ReasonFlux-PRM, a novel trajectory-aware PRM that evaluates reasoning traces and supports reward supervision in offline and online settings, improving model training and inference.

Findings

01

ReasonFlux-PRM-7B outperforms strong PRMs and human baselines in data quality.

02

Achieves 12.1% improvement in supervised fine-tuning performance.

03

Provides 4.5% gains in reinforcement learning and 6.3% in test-time scaling.

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.